autovacuum stress-testing our system
Hi,
I've been struggling with autovacuum generating a lot of I/O and CPU on
some of our
systems - after a night spent analyzing this behavior, I believe the
current
autovacuum accidentally behaves a bit like a stress-test in some corner
cases (but
I may be seriously wrong, after all it was a long night).
First - our system really is not a "common" one - we do have ~1000 of
databases of
various size, each containing up to several thousands of tables
(several user-defined
tables, the rest serve as caches for a reporting application - yes,
it's a bit weird
design but that's life). This all leads to pgstat.stat significantly
larger than 60 MB.
Now, the two main pieces of information from the pgstat.c are the timer
definitions
---------------------------------- pgstat.c : 80
----------------------------------
#define PGSTAT_STAT_INTERVAL 500 /* Minimum time between stats
file
* updates; in milliseconds. */
#define PGSTAT_RETRY_DELAY 10 /* How long to wait between
checks for
* a new file; in milliseconds.
*/
#define PGSTAT_MAX_WAIT_TIME 10000 /* Maximum time to wait for a
stats
* file update; in milliseconds.
*/
#define PGSTAT_INQ_INTERVAL 640 /* How often to ping the
collector for
* a new file; in milliseconds.
*/
#define PGSTAT_RESTART_INTERVAL 60 /* How often to attempt to
restart a
* failed statistics collector;
in
* seconds. */
#define PGSTAT_POLL_LOOP_COUNT (PGSTAT_MAX_WAIT_TIME /
PGSTAT_RETRY_DELAY)
#define PGSTAT_INQ_LOOP_COUNT (PGSTAT_INQ_INTERVAL /
PGSTAT_RETRY_DELAY)
-----------------------------------------------------------------------------------
and then this loop (the current HEAD does this a bit differently, but
the 9.2 code
is a bit readable and suffers the same issue):
---------------------------------- pgstat.c : 3560
--------------------------------
/*
* Loop until fresh enough stats file is available or we ran out of
time.
* The stats inquiry message is sent repeatedly in case collector
drops
* it; but not every single time, as that just swamps the collector.
*/
for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
{
TimestampTz file_ts = 0;
CHECK_FOR_INTERRUPTS();
if (pgstat_read_statsfile_timestamp(false, &file_ts) &&
file_ts >= min_ts)
break;
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
pgstat_send_inquiry(min_ts);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
if (count >= PGSTAT_POLL_LOOP_COUNT)
elog(WARNING, "pgstat wait timeout");
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
else
pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
-----------------------------------------------------------------------------------
What this code does it that it checks the statfile, and if it's not
stale (the
timestamp of the write start is not older than PGSTAT_RETRY_DELAY
milliseconds),
the loop is terminated and the file is read.
Now, let's suppose the write takes >10 ms, which is the
PGSTAT_RETRY_DELAY values.
With our current pgstat.stat filesize/num of relations, this is quite
common.
Actually the common write time in our case is ~100 ms, even if we move
the file
into tmpfs. That means that almost all the calls to
backend_read_statsfile (which
happen in all pgstat_fetch_stat_*entry calls) result in continuous
stream of
inquiries from the autovacuum workers, writing/reading of the file.
We're not getting 'pgstat wait timeout' though, because it finally gets
written
before PGSTAT_MAX_WAIT_TIME.
By moving the file to a tmpfs we've minimized the I/O impact, but now
the collector
and autovacuum launcher consume ~75% of CPU (i.e. ~ one core) and do
nothing except
burning power because the database is almost read-only. Not a good
thing in the
"green computing" era I guess.
First, I'm interested in feedback - did I get all the details right, or
am I
missing something important?
Next, I'm thinking about ways to solve this:
1) turning of autovacuum, doing regular VACUUM ANALYZE from cron -
certainly an
option, but it's rather a workaround than a solution and I'm not
very fond of
it. Moreover it fixes only one side of the problem - triggering the
statfile
writes over and over. The file will be written anyway, although not
that
frequently.
2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY
and so on
to consider several seconds to be fresh enough - Would be nice to
have this
as a GUC variables, although we can do another private patch on our
own. But
more knobs is not always better.
3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly
on the time
it takes to write the file (e.g. 10x the write time or something).
4) keeping some sort of "dirty flag" in stat entries - and then writing
only info
about objects were modified enough to be eligible for vacuum/analyze
(e.g.
increasing number of index scans can't trigger autovacuum while
inserting
rows can). Also, I'm not worried about getting a bit older num of
index scans,
so 'clean' records might be written less frequently than 'dirty'
ones.
5) splitting the single stat file into multiple pieces - e.g. per
database,
written separately, so that the autovacuum workers don't need to
read all
the data even for databases that don't need to be vacuumed. This
might be
combined with (4).
Ideas? Objections? Preferred options?
I kinda like (4+5), although that'd be a pretty big patch and I'm not
entirely
sure it can be done without breaking other things.
regards
Tomas
Really, as far as autovacuum is concerned, it would be much more useful
to be able to reliably detect that a table has been recently vacuumed,
without having to request a 10ms-recent pgstat snapshot. That would
greatly reduce the amount of time autovac spends on pgstat requests.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Sep 26, 2012 at 5:43 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
First - our system really is not a "common" one - we do have ~1000 of
databases of
various size, each containing up to several thousands of tables (several
user-defined
tables, the rest serve as caches for a reporting application - yes, it's a
bit weird
design but that's life). This all leads to pgstat.stat significantly larger
than 60 MB.
...
Now, let's suppose the write takes >10 ms, which is the PGSTAT_RETRY_DELAY
values.
With our current pgstat.stat filesize/num of relations, this is quite
common.
Actually the common write time in our case is ~100 ms, even if we move the
file
into tmpfs. That means that almost all the calls to backend_read_statsfile
(which
happen in all pgstat_fetch_stat_*entry calls) result in continuous stream of
inquiries from the autovacuum workers, writing/reading of the file.
I don't think it actually does. What you are missing is the same
thing I was missing a few weeks ago when I also looked into something
like this.
3962:
* We don't recompute min_ts after sleeping, except in the
* unlikely case that cur_ts went backwards.
That means the file must have been written within 10 ms of when we
*first* asked for it.
What is generating the endless stream you are seeing is that you have
1000 databases so if naptime is one minute you are vacuuming 16 per
second. Since every database gets a new process, that process needs
to read the file as it doesn't inherit one.
...
First, I'm interested in feedback - did I get all the details right, or am I
missing something important?Next, I'm thinking about ways to solve this:
1) turning of autovacuum, doing regular VACUUM ANALYZE from cron
Increasing autovacuum_naptime seems like a far better way to do
effectively the same thing.
2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY and
so on
to consider several seconds to be fresh enough - Would be nice to have
this
as a GUC variables, although we can do another private patch on our own.
But
more knobs is not always better.
I think forking it off to to another value would be better. If you
are an autovacuum worker which is just starting up and so getting its
initial stats, you can tolerate a stats file up to "autovacuum_naptime
/ 5.0" stale. If you are already started up and are just about to
vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it
currently is, so as not to redundantly vacuum a table.
3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on the
time
it takes to write the file (e.g. 10x the write time or something).
This is already in place.
5) splitting the single stat file into multiple pieces - e.g. per database,
written separately, so that the autovacuum workers don't need to read all
the data even for databases that don't need to be vacuumed. This might be
combined with (4).
I think this needs to happen eventually.
Cheers,
Jeff
On 26-09-2012 09:43, Tomas Vondra wrote:
I've been struggling with autovacuum generating a lot of I/O and CPU on some
of our
systems - after a night spent analyzing this behavior, I believe the current
autovacuum accidentally behaves a bit like a stress-test in some corner cases
(but
I may be seriously wrong, after all it was a long night).
It is known that statistic collector doesn't scale for a lot of databases. It
wouldn't be a problem if we don't have automatic maintenance (aka autovacuum).
Next, I'm thinking about ways to solve this:
1) turning of autovacuum, doing regular VACUUM ANALYZE from cron - certainly an
option, but it's rather a workaround than a solution and I'm not very fond of
it. Moreover it fixes only one side of the problem - triggering the statfile
writes over and over. The file will be written anyway, although not that
frequently.
It solves your problem if you combine scheduled VA with pgstat.stat in a
tmpfs. I don't see it as a definitive solution if we want to scale auto
maintenance for several hundreds or even thousands databases in a single
cluster (Someone could think it is not that common but in hosting scenarios
this is true. DBAs don't want to run several VMs or pg servers just to
minimize the auto maintenance scalability problem).
2) tweaking the timer values, especially increasing PGSTAT_RETRY_DELAY and so on
to consider several seconds to be fresh enough - Would be nice to have this
as a GUC variables, although we can do another private patch on our own. But
more knobs is not always better.
It doesn't solve the problem. Also it could be a problem for autovacuum (that
make assumptions based in those fixed values).
3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on the time
it takes to write the file (e.g. 10x the write time or something).
Such adaptive logic would be good only iff it takes a small time fraction to
execute. It have to pay attention to the limits. It appears to be a candidate
for exploration.
4) keeping some sort of "dirty flag" in stat entries - and then writing only info
about objects were modified enough to be eligible for vacuum/analyze (e.g.
increasing number of index scans can't trigger autovacuum while inserting
rows can). Also, I'm not worried about getting a bit older num of index scans,
so 'clean' records might be written less frequently than 'dirty' ones.
It minimizes your problem but harms collector tools (that want fresh
statistics about databases).
5) splitting the single stat file into multiple pieces - e.g. per database,
written separately, so that the autovacuum workers don't need to read all
the data even for databases that don't need to be vacuumed. This might be
combined with (4).
IMHO that's the definitive solution. It would be one file per database plus a
global one. That way, the check would only read the global.stat and process
those database that were modified. Also, an in-memory map could store that
information to speed up the checks. The only downside I can see is that you
will increase the number of opened file descriptors.
Ideas? Objections? Preferred options?
I prefer to attack 3, sort of 4 (explained in 5 -- in-memory map) and 5.
Out of curiosity, did you run perf (or some other performance analyzer) to
verify if some (stats and/or autovac) functions pop up in the report?
--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Dne 26.09.2012 16:51, Jeff Janes napsal:
On Wed, Sep 26, 2012 at 5:43 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
First - our system really is not a "common" one - we do have ~1000
of
databases of
various size, each containing up to several thousands of tables
(several
user-defined
tables, the rest serve as caches for a reporting application - yes,
it's a
bit weird
design but that's life). This all leads to pgstat.stat significantly
larger
than 60 MB....
Now, let's suppose the write takes >10 ms, which is the
PGSTAT_RETRY_DELAY
values.
With our current pgstat.stat filesize/num of relations, this is
quite
common.
Actually the common write time in our case is ~100 ms, even if we
move the
file
into tmpfs. That means that almost all the calls to
backend_read_statsfile
(which
happen in all pgstat_fetch_stat_*entry calls) result in continuous
stream of
inquiries from the autovacuum workers, writing/reading of the file.I don't think it actually does. What you are missing is the same
thing I was missing a few weeks ago when I also looked into something
like this.3962:
* We don't recompute min_ts after sleeping, except in
the
* unlikely case that cur_ts went backwards.That means the file must have been written within 10 ms of when we
*first* asked for it.
Yeah, right - I've missed the first "if (pgStatDBHash)" check right at
the beginning.
What is generating the endless stream you are seeing is that you have
1000 databases so if naptime is one minute you are vacuuming 16 per
second. Since every database gets a new process, that process needs
to read the file as it doesn't inherit one.
Right. But that makes the 10ms timeout even more strange, because the
worker is then using the data for very long time (even minutes).
...
First, I'm interested in feedback - did I get all the details right,
or am I
missing something important?Next, I'm thinking about ways to solve this:
1) turning of autovacuum, doing regular VACUUM ANALYZE from cron
Increasing autovacuum_naptime seems like a far better way to do
effectively the same thing.
Agreed. One of my colleagues turned autovacuum off a few years back and
that
was a nice lesson how not to solve this kind of issues.
2) tweaking the timer values, especially increasing
PGSTAT_RETRY_DELAY and
so on
to consider several seconds to be fresh enough - Would be nice to
have
this
as a GUC variables, although we can do another private patch on
our own.
But
more knobs is not always better.I think forking it off to to another value would be better. If you
are an autovacuum worker which is just starting up and so getting its
initial stats, you can tolerate a stats file up to
"autovacuum_naptime
/ 5.0" stale. If you are already started up and are just about to
vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it
currently is, so as not to redundantly vacuum a table.
I always thought there's a "no more than one worker per database"
limit,
and that the file is always reloaded when switching to another
database.
So I'm not sure how could a worker see such a stale table info? Or are
the workers keeping the stats across multiple databases?
3) logic detecting the proper PGSTAT_RETRY_DELAY value - based
mostly on the
time
it takes to write the file (e.g. 10x the write time or
something).This is already in place.
Really? Where?
I've checked the current master, and the only thing I see in
pgstat_write_statsfile
is this (line 3558):
last_statwrite = globalStats.stats_timestamp;
https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgstat.c#L3558
I don't think that's doing what I meant. That really doesn't scale the
timeout
according to write time. What happens right now is that when the stats
file is
written at time 0 (starts at zero, write finishes at 100 ms), and a
worker asks
for the file at 99 ms (i.e. 1ms before the write finishes), it will set
the time
of the inquiry to last_statrequest and then do this
if (last_statwrite < last_statrequest)
pgstat_write_statsfile(false);
i.e. comparing it to the start of the write. So another write will
start right
after the file is written out. And over and over.
Moreover there's the 'rename' step making the new file invisible for
the worker
processes, which makes the thing a bit more complicated.
What I'm suggesting it that there should be some sort of tracking the
write time
and then deciding whether the file is fresh enough using 10x that
value. So when
a file is written in 100 ms, it's be considered OK for the next 900 ms,
i.e. 1 sec
in total. Sure, we could use 5x or other coefficient, doesn't really
matter.
5) splitting the single stat file into multiple pieces - e.g. per
database,
written separately, so that the autovacuum workers don't need to
read all
the data even for databases that don't need to be vacuumed. This
might be
combined with (4).I think this needs to happen eventually.
Yes, a nice patch idea ;-)
thanks for the feedback
Tomas
Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012:
On 26-09-2012 09:43, Tomas Vondra wrote:
5) splitting the single stat file into multiple pieces - e.g. per database,
written separately, so that the autovacuum workers don't need to read all
the data even for databases that don't need to be vacuumed. This might be
combined with (4).IMHO that's the definitive solution. It would be one file per database plus a
global one. That way, the check would only read the global.stat and process
those database that were modified. Also, an in-memory map could store that
information to speed up the checks.
+1
The only downside I can see is that you
will increase the number of opened file descriptors.
Note that most users of pgstat will only have two files open (instead of
one as currently) -- one for shared, one for their own database. Only
pgstat itself and autovac launcher would need to open pgstat files for
all databases; but both do not have a need to open other files
(arbitrary tables) so this shouldn't be a major problem.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Tomas Vondra's message of mié sep 26 12:25:58 -0300 2012:
Dne 26.09.2012 16:51, Jeff Janes napsal:
I think forking it off to to another value would be better. If you
are an autovacuum worker which is just starting up and so getting its
initial stats, you can tolerate a stats file up to
"autovacuum_naptime
/ 5.0" stale. If you are already started up and are just about to
vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it
currently is, so as not to redundantly vacuum a table.I always thought there's a "no more than one worker per database"
limit,
There is no such limitation.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Dne 26.09.2012 17:29, Alvaro Herrera napsal:
Excerpts from Tomas Vondra's message of mié sep 26 12:25:58 -0300
2012:Dne 26.09.2012 16:51, Jeff Janes napsal:
I think forking it off to to another value would be better. If
you
are an autovacuum worker which is just starting up and so getting
its
initial stats, you can tolerate a stats file up to
"autovacuum_naptime
/ 5.0" stale. If you are already started up and are just about to
vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY asit
currently is, so as not to redundantly vacuum a table.
I always thought there's a "no more than one worker per database"
limit,There is no such limitation.
OK, thanks. Still, reading/writing the small (per-database) files would
be
much faster so it would be easy to read/write them more often on
demand.
Tomas
On Wed, Sep 26, 2012 at 8:25 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 26.09.2012 16:51, Jeff Janes napsal:
What is generating the endless stream you are seeing is that you have
1000 databases so if naptime is one minute you are vacuuming 16 per
second. Since every database gets a new process, that process needs
to read the file as it doesn't inherit one.Right. But that makes the 10ms timeout even more strange, because the
worker is then using the data for very long time (even minutes).
On average that can't happen, or else your vacuuming would fall way
behind. But I agree, there is no reason to have very fresh statistics
to start with. naptime/5 seems like a good cutoff for me for the
start up reading. If a table only becomes eligible for vacuuming in
the last 20% of the naptime, I see no reason that it can't wait
another round. But that just means the statistics collector needs to
write the file less often, the workers still need to read it once per
database since each one only vacuums one database and don't inherit
the data from the launcher.
I think forking it off to to another value would be better. If you
are an autovacuum worker which is just starting up and so getting its
initial stats, you can tolerate a stats file up to "autovacuum_naptime
/ 5.0" stale. If you are already started up and are just about to
vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it
currently is, so as not to redundantly vacuum a table.I always thought there's a "no more than one worker per database" limit,
and that the file is always reloaded when switching to another database.
So I'm not sure how could a worker see such a stale table info? Or are
the workers keeping the stats across multiple databases?
If you only have one "active" database, then all the workers will be
in it. I don't how likely it is that they will leap frog each other
and collide. But anyway, if you 1000s of databases, then each one
will generally require zero vacuums per naptime (as you say, it is
mostly read only), so it is the reads upon start-up, not the reads per
table that needs vacuuming, which generates most of the traffic. Once
you separate those two parameters out, playing around with the
PGSTAT_RETRY_DELAY one seems like a needless risk.
3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on
the
time
it takes to write the file (e.g. 10x the write time or something).This is already in place.
Really? Where?
I had thought that this part was effectively the same thing:
* We don't recompute min_ts after sleeping, except in the
* unlikely case that cur_ts went backwards.
But I think I did not understand your proposal.
I've checked the current master, and the only thing I see in
pgstat_write_statsfile
is this (line 3558):last_statwrite = globalStats.stats_timestamp;
https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgstat.c#L3558
I don't think that's doing what I meant. That really doesn't scale the
timeout
according to write time. What happens right now is that when the stats file
is
written at time 0 (starts at zero, write finishes at 100 ms), and a worker
asks
for the file at 99 ms (i.e. 1ms before the write finishes), it will set the
time
of the inquiry to last_statrequest and then do thisif (last_statwrite < last_statrequest)
pgstat_write_statsfile(false);i.e. comparing it to the start of the write. So another write will start
right
after the file is written out. And over and over.
Ah. I had wondered about this before too, and wondered if it would be
a good idea to have it go back to the beginning of the stats file, and
overwrite the timestamp with the current time (rather than the time it
started writing it), as the last action it does before the rename. I
think that would automatically make it adaptive to the time it takes
to write out the file, in a fairly simple way.
Moreover there's the 'rename' step making the new file invisible for the
worker
processes, which makes the thing a bit more complicated.
I think renames are assumed to be atomic. Either it sees the old one,
or the new one, but never sees neither.
Cheers,
Jeff
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012:
On 26-09-2012 09:43, Tomas Vondra wrote:
5) splitting the single stat file into multiple pieces - e.g. per database,
written separately, so that the autovacuum workers don't need to read all
the data even for databases that don't need to be vacuumed. This might be
combined with (4).
IMHO that's the definitive solution. It would be one file per database plus a
global one. That way, the check would only read the global.stat and process
those database that were modified. Also, an in-memory map could store that
information to speed up the checks.
+1
That would help for the case of hundreds of databases, but how much
does it help for lots of tables in a single database?
I'm a bit suspicious of the idea that we should encourage people to use
hundreds of databases per installation anyway: the duplicated system
catalogs are going to be mighty expensive, both in disk space and in
their cache footprint in shared buffers. There was some speculation
at the last PGCon about how we might avoid the duplication, but I think
we're years away from any such thing actually happening.
What seems to me like it could help more is fixing things so that the
autovac launcher needn't even launch a child process for databases that
haven't had any updates lately. I'm not sure how to do that, but it
probably involves getting the stats collector to produce some kind of
summary file.
regards, tom lane
On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012:
On 26-09-2012 09:43, Tomas Vondra wrote:
5) splitting the single stat file into multiple pieces - e.g. per database,
written separately, so that the autovacuum workers don't need to read all
the data even for databases that don't need to be vacuumed. This might be
combined with (4).IMHO that's the definitive solution. It would be one file per database plus a
global one. That way, the check would only read the global.stat and process
those database that were modified. Also, an in-memory map could store that
information to speed up the checks.+1
That would help for the case of hundreds of databases, but how much
does it help for lots of tables in a single database?
It doesn't help that case, but that case doesn't need much help. If
you have N statistics-kept objects in total spread over M databases,
of which T objects need vacuuming per naptime, the stats file traffic
is proportional to N*(M+T). If T is low, then there is generally is
no problem if M is also low. Or at least, the problem is much smaller
than when M is high for a fixed value of N.
I'm a bit suspicious of the idea that we should encourage people to use
hundreds of databases per installation anyway:
I agree with that, but we could still do a better job of tolerating
it; without encouraging it. If someone volunteers to write the code
to do this, what trade-offs would there be?
Cheers,
Jeff
On 26.9.2012 18:29, Tom Lane wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Excerpts from Euler Taveira's message of miĂŠ sep 26 11:53:27 -0300 2012:
On 26-09-2012 09:43, Tomas Vondra wrote:
5) splitting the single stat file into multiple pieces - e.g. per database,
written separately, so that the autovacuum workers don't need to read all
the data even for databases that don't need to be vacuumed. This might be
combined with (4).IMHO that's the definitive solution. It would be one file per database plus a
global one. That way, the check would only read the global.stat and process
those database that were modified. Also, an in-memory map could store that
information to speed up the checks.+1
That would help for the case of hundreds of databases, but how much
does it help for lots of tables in a single database?
Well, it wouldn't, but it wouldn't make it worse either. Or at least
that's how I understand it.
I'm a bit suspicious of the idea that we should encourage people to use
hundreds of databases per installation anyway: the duplicated system
catalogs are going to be mighty expensive, both in disk space and in
their cache footprint in shared buffers. There was some speculation
at the last PGCon about how we might avoid the duplication, but I think
we're years away from any such thing actually happening.
You don't need to encourage us to do that ;-) We know it's not perfect
and considering a good alternative - e.g. several databases (~10) with
schemas inside, replacing the current database-only approach. This way
we'd get multiple stat files (thus gaining the benefits) with less
overhead (shared catalogs).
And yes, using tens of thousands of tables (serving as "caches") for a
reporting solution is "interesting" (as in the old Chinese curse) too.
What seems to me like it could help more is fixing things so that the
autovac launcher needn't even launch a child process for databases that
haven't had any updates lately. I'm not sure how to do that, but it
probably involves getting the stats collector to produce some kind of
summary file.
Yes, I've proposed something like this in my original mail - setting a
"dirty" flag on objects (a database in this case) whenever a table in it
gets eligible for vacuum/analyze.
Tomas
On 26.9.2012 18:14, Jeff Janes wrote:
On Wed, Sep 26, 2012 at 8:25 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
Dne 26.09.2012 16:51, Jeff Janes napsal:
What is generating the endless stream you are seeing is that you have
1000 databases so if naptime is one minute you are vacuuming 16 per
second. Since every database gets a new process, that process needs
to read the file as it doesn't inherit one.Right. But that makes the 10ms timeout even more strange, because the
worker is then using the data for very long time (even minutes).On average that can't happen, or else your vacuuming would fall way
behind. But I agree, there is no reason to have very fresh statistics
to start with. naptime/5 seems like a good cutoff for me for the
start up reading. If a table only becomes eligible for vacuuming in
the last 20% of the naptime, I see no reason that it can't wait
another round. But that just means the statistics collector needs to
write the file less often, the workers still need to read it once per
database since each one only vacuums one database and don't inherit
the data from the launcher.
So what happens if there are two workers vacuuming the same database?
Wouldn't that make it more probable that were already vacuumed by the
other worker?
See the comment at the beginning of autovacuum.c, where it also states
that the statfile is reloaded before each table (probably because of the
calls to autovac_refresh_stats which in turn calls clear_snapshot).
I think forking it off to to another value would be better. If you
are an autovacuum worker which is just starting up and so getting its
initial stats, you can tolerate a stats file up to "autovacuum_naptime
/ 5.0" stale. If you are already started up and are just about to
vacuum a table, then keep the staleness at PGSTAT_RETRY_DELAY as it
currently is, so as not to redundantly vacuum a table.I always thought there's a "no more than one worker per database" limit,
and that the file is always reloaded when switching to another database.
So I'm not sure how could a worker see such a stale table info? Or are
the workers keeping the stats across multiple databases?If you only have one "active" database, then all the workers will be
in it. I don't how likely it is that they will leap frog each other
and collide. But anyway, if you 1000s of databases, then each one
will generally require zero vacuums per naptime (as you say, it is
mostly read only), so it is the reads upon start-up, not the reads per
table that needs vacuuming, which generates most of the traffic. Once
you separate those two parameters out, playing around with the
PGSTAT_RETRY_DELAY one seems like a needless risk.
OK, right. My fault.
Yes, our databases are mostly readable - more precisely whenever we load
data, we immediately do VACUUM ANALYZE on the tables, so autovacuum
never kicks in on them. The only thing it works on are system catalogs
and such stuff.
3) logic detecting the proper PGSTAT_RETRY_DELAY value - based mostly on
the
time
it takes to write the file (e.g. 10x the write time or something).This is already in place.
Really? Where?
I had thought that this part was effectively the same thing:
* We don't recompute min_ts after sleeping, except in the
* unlikely case that cur_ts went backwards.But I think I did not understand your proposal.
I've checked the current master, and the only thing I see in
pgstat_write_statsfile
is this (line 3558):last_statwrite = globalStats.stats_timestamp;
https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgstat.c#L3558
I don't think that's doing what I meant. That really doesn't scale the
timeout
according to write time. What happens right now is that when the stats file
is
written at time 0 (starts at zero, write finishes at 100 ms), and a worker
asks
for the file at 99 ms (i.e. 1ms before the write finishes), it will set the
time
of the inquiry to last_statrequest and then do thisif (last_statwrite < last_statrequest)
pgstat_write_statsfile(false);i.e. comparing it to the start of the write. So another write will start
right
after the file is written out. And over and over.Ah. I had wondered about this before too, and wondered if it would be
a good idea to have it go back to the beginning of the stats file, and
overwrite the timestamp with the current time (rather than the time it
started writing it), as the last action it does before the rename. I
think that would automatically make it adaptive to the time it takes
to write out the file, in a fairly simple way.
Yeah, I was thinking about that too.
Moreover there's the 'rename' step making the new file invisible for the
worker
processes, which makes the thing a bit more complicated.I think renames are assumed to be atomic. Either it sees the old one,
or the new one, but never sees neither.
I'm not quite sure what I meant, but not this - I know the renames are
atomic. I probably haven't noticed that inquiries are using min_ts, so I
though that an inquiry sent right after the write starts (with min_ts
before the write) would trigger another write, but that's not the case.
regards
Tomas
On 26 September 2012 15:47, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Really, as far as autovacuum is concerned, it would be much more useful
to be able to reliably detect that a table has been recently vacuumed,
without having to request a 10ms-recent pgstat snapshot. That would
greatly reduce the amount of time autovac spends on pgstat requests.
VACUUMing generates a relcache invalidation. Can we arrange for those
invalidations to be received by autovac launcher, so it gets immediate
feedback of recent activity without polling?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Excerpts from Simon Riggs's message of jue sep 27 06:51:28 -0300 2012:
On 26 September 2012 15:47, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Really, as far as autovacuum is concerned, it would be much more useful
to be able to reliably detect that a table has been recently vacuumed,
without having to request a 10ms-recent pgstat snapshot. That would
greatly reduce the amount of time autovac spends on pgstat requests.VACUUMing generates a relcache invalidation. Can we arrange for those
invalidations to be received by autovac launcher, so it gets immediate
feedback of recent activity without polling?
Hmm, this is an interesting idea worth exploring, I think. Maybe we
should sort tables in the autovac worker to-do list by age of last
invalidation messages received, or something like that. Totally unclear
on the details, but as I said, worth exploring.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 27 September 2012 15:57, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Excerpts from Simon Riggs's message of jue sep 27 06:51:28 -0300 2012:
On 26 September 2012 15:47, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Really, as far as autovacuum is concerned, it would be much more useful
to be able to reliably detect that a table has been recently vacuumed,
without having to request a 10ms-recent pgstat snapshot. That would
greatly reduce the amount of time autovac spends on pgstat requests.VACUUMing generates a relcache invalidation. Can we arrange for those
invalidations to be received by autovac launcher, so it gets immediate
feedback of recent activity without polling?Hmm, this is an interesting idea worth exploring, I think. Maybe we
should sort tables in the autovac worker to-do list by age of last
invalidation messages received, or something like that. Totally unclear
on the details, but as I said, worth exploring.
Just put them to back of queue if an inval is received.
There is already support for listening and yet never generating to
relcache inval messages.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Hi!
On 26.9.2012 19:18, Jeff Janes wrote:
On Wed, Sep 26, 2012 at 9:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Excerpts from Euler Taveira's message of mié sep 26 11:53:27 -0300 2012:
On 26-09-2012 09:43, Tomas Vondra wrote:
5) splitting the single stat file into multiple pieces - e.g. per database,
written separately, so that the autovacuum workers don't need to read all
the data even for databases that don't need to be vacuumed. This might be
combined with (4).IMHO that's the definitive solution. It would be one file per database plus a
global one. That way, the check would only read the global.stat and process
those database that were modified. Also, an in-memory map could store that
information to speed up the checks.+1
That would help for the case of hundreds of databases, but how much
does it help for lots of tables in a single database?It doesn't help that case, but that case doesn't need much help. If
you have N statistics-kept objects in total spread over M databases,
of which T objects need vacuuming per naptime, the stats file traffic
is proportional to N*(M+T). If T is low, then there is generally is
no problem if M is also low. Or at least, the problem is much smaller
than when M is high for a fixed value of N.
I've done some initial hacking on splitting the stat file into multiple
smaller pieces over the weekend, and it seems promising (at least with
respect to the issues we're having).
See the patch attached, but be aware that this is a very early WIP (or
rather a proof of concept), so it has many rough edges (read "sloppy
coding"). I haven't even added it to the commitfest yet ...
The two main changes are these:
(1) The stats file is split into a common "db" file, containing all the
DB Entries, and per-database files with tables/functions. The common
file is still called "pgstat.stat", the per-db files have the
database OID appended, so for example "pgstat.stat.12345" etc.
This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
functions, introducing two new functions:
pgstat_read_db_statsfile
pgstat_write_db_statsfile
that do the trick of reading/writing stat file for one database.
(2) The pgstat_read_statsfile has an additional parameter "onlydbs" that
says that you don't need table/func stats - just the list of db
entries. This is used for autovacuum launcher, which does not need
to read the table/stats (if I'm reading the code in autovacuum.c
correctly - it seems to be working as expected).
So what are the benefits?
(a) When a launcher asks for info about databases, something like this
is called in the end:
pgstat_read_db_statsfile(InvalidOid, false, true)
which means all databases (InvalidOid) and only db info (true). So
it reads only the one common file with db entries, not the
table/func stats.
(b) When a worker asks for stats for a given DB, something like this is
called in the end:
pgstat_read_db_statsfile(MyDatabaseId, false, false)
which reads only the common stats file (with db entries) and only
one file for the one database.
The current implementation (with the single pgstat.stat file), all
the data had to be read (and skipped silently) in both cases.
That's a lot of CPU time, and we're seeing ~60% of CPU spent on
doing just this (writing/reading huge statsfile).
So with a lot of databases/objects, this "pgstat.stat split" saves
us a lot of CPU ...
(c) This should lower the space requirements too - with a single file,
you actually need at least 2x the disk space (or RAM, if you're
using tmpfs as we are), because you need to keep two versions of
the file at the same time (pgstat.stat and pgstat.tmp).
Thanks to this split you only need additional space for a copy of
the largest piece (with some reasonable safety reserve).
Well, it's very early patch, so there are rough edges too
(a) It does not solve the "many-schema" scenario at all - that'll need
a completely new approach I guess :-(
(b) It does not solve the writing part at all - the current code uses a
single timestamp (last_statwrite) to decide if a new file needs to
be written.
That clearly is not enough for multiple files - there should be one
timestamp for each database/file. I'm thinking about how to solve
this and how to integrate it with pgstat_send_inquiry etc.
One way might be adding the timestamp(s) into PgStat_StatDBEntry
and the other one is using an array of inquiries for each database.
And yet another one I'm thinking about is using a fixed-length
array of timestamps (e.g. 256), indexed by mod(dboid,256). That
would mean stats for all databases with the same mod(oid,256) would
be written at the same time. Seems like an over-engineering though.
(c) I'm a bit worried about the number of files - right now there's one
for each database and I'm thinking about splitting them by type
(one for tables, one for functions) which might make it even faster
for some apps with a lot of stored procedures etc.
But is the large number of files actually a problem? After all,
we're using one file per relation fork in the "base" directory, so
this seems like a minor issue.
And if really an issue, this might be solved by the mod(oid,256) to
combine multiple files into one (which would work neatly with the
fixed-length array of timestamps).
kind regards
Tomas
Attachments:
stats-split.patchtext/plain; charset=UTF-8; name=stats-split.patchDownload
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index be3adf1..226311c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -253,7 +253,9 @@ static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
static void pgstat_write_statsfile(bool permanent);
-static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent);
+static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs);
+static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
@@ -1408,13 +1410,14 @@ pgstat_ping(void)
* ----------
*/
static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
+pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
@@ -3063,7 +3066,7 @@ PgstatCollectorMain(int argc, char *argv[])
* zero.
*/
pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, false);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -3435,11 +3438,7 @@ static void
pgstat_write_statsfile(bool permanent)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
@@ -3493,29 +3492,15 @@ pgstat_write_statsfile(bool permanent)
(void) rc; /* we'll check for error with ferror */
/*
- * Walk through the database's access stats per table.
+ * Write our the tables and functions into a separate file.
*/
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
- {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
+ pgstat_write_db_statsfile(dbentry, permanent);
/*
* Mark the end of this DB
+ *
+ * FIXME does it really make much sense, when the tables/functions
+ * are moved to a separate file (using those chars?)
*/
fputc('d', fpout);
}
@@ -3587,6 +3572,111 @@ pgstat_write_statsfile(bool permanent)
}
+
+/* ----------
+ * pgstat_write_statsfile() -
+ *
+ * Tell the news.
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+{
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int rc;
+
+ /* FIXME Disgusting. Handle properly ... */
+ const char *tmpfile_x = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
+ const char *statfile_x = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+
+ char tmpfile[255];
+ char statfile[255];
+
+ /* FIXME Do some kind of reduction (e.g. mod(oid,255)) not to end with thousands of files,
+ *one for each database */
+ snprintf(tmpfile, 255, "%s.%d", tmpfile_x, dbentry->databaseid);
+ snprintf(statfile, 255, "%s.%d", statfile_x, dbentry->databaseid);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ tmpfile)));
+ return;
+ }
+
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file \"%s\": %m",
+ tmpfile)));
+ FreeFile(fpout);
+ unlink(tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file \"%s\": %m",
+ tmpfile)));
+ unlink(tmpfile);
+ }
+ else if (rename(tmpfile, statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+ tmpfile, statfile)));
+ unlink(tmpfile);
+ }
+
+ // if (permanent)
+ // unlink(pgstat_stat_filename);
+}
+
/* ----------
* pgstat_read_statsfile() -
*
@@ -3595,14 +3685,10 @@ pgstat_write_statsfile(bool permanent)
* ----------
*/
static HTAB *
-pgstat_read_statsfile(Oid onlydb, bool permanent)
+pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
HTAB *tabhash = NULL;
@@ -3758,6 +3844,16 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
*/
tabhash = dbentry->tables;
funchash = dbentry->functions;
+
+ /*
+ * Read the data from the file for this database. If there was
+ * onlydb specified (!= InvalidOid), we would not get here because
+ * of a break above. So we don't need to recheck.
+ */
+ if (! onlydbs)
+ pgstat_read_db_statsfile(dbentry->databaseid, tabhash, funchash,
+ permanent);
+
break;
/*
@@ -3768,6 +3864,79 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
funchash = NULL;
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+done:
+ FreeFile(fpin);
+
+ if (permanent)
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+
+ return dbhash;
+}
+
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent)
+{
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ bool found;
+
+ /* FIXME Disgusting. Handle properly ... */
+ const char *statfile_x = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ char statfile[255];
+
+ /* FIXME Do some kind of reduction (e.g. mod(oid,255)) not to end with thousands of files,
+ *one for each database */
+ snprintf(statfile, 255, "%s.%d", statfile_x, databaseid);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file \"%s\": %m",
+ statfile)));
+ return;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
@@ -3853,10 +4022,11 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
done:
FreeFile(fpin);
- if (permanent)
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+// FIXME unlink permanent filename (with the proper Oid appended
+// if (permanent)
+// unlink(PGSTAT_STAT_PERMANENT_FILENAME);
- return dbhash;
+ return;
}
/* ----------
@@ -4006,7 +4176,7 @@ backend_read_statsfile(void)
pfree(mytime);
}
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, InvalidOid);
break;
}
@@ -4016,7 +4186,7 @@ backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, InvalidOid);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
@@ -4026,9 +4196,16 @@ backend_read_statsfile(void)
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
+ /*
+ * FIXME Does it really need info including tables/functions? Or is it enough to read
+ * database-level stats? It seems to me the launcher needs PgStat_StatDBEntry only
+ * (at least that's how I understand the rebuild_database_list() in autovacuum.c),
+ * because pgstat_stattabentries are used in do_autovacuum() only, that that's what's
+ * executed in workers ... So maybe we'd be just fine by reading in the dbentries?
+ */
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, true);
else
- pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
+ pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false, false);
}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 613c1c2..8971002 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB (InvalidOid => all DBs) */
} PgStat_MsgInquiry;
On 26.9.2012 18:29, Tom Lane wrote:
What seems to me like it could help more is fixing things so that the
autovac launcher needn't even launch a child process for databases that
haven't had any updates lately. I'm not sure how to do that, but it
probably involves getting the stats collector to produce some kind of
summary file.
Couldn't we use the PgStat_StatDBEntry for this? By splitting the
pgstat.stat file into multiple pieces (see my other post in this thread)
there's a file with StatDBEntry items only, so maybe it could be used as
the summary file ...
I've been thinking about this:
(a) add "needs_autovacuuming" flag to PgStat_(TableEntry|StatDBEntry)
(b) when table stats are updated, run quick check to decide whether
the table needs to be processed by autovacuum (vacuumed or
analyzed), and if yes then set needs_autovacuuming=true and both
for table and database
The worker may read the DB entries from the file and act only on those
that need to be processed (those with needs_autovacuuming=true).
Maybe the DB-level field might be a counter of tables that need to be
processed, and the autovacuum daemon might act on those first? Although
the simpler the better I guess.
Or did you mean something else?
regards
Tomas
On Sun, Nov 18, 2012 at 5:49 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
The two main changes are these:
(1) The stats file is split into a common "db" file, containing all the
DB Entries, and per-database files with tables/functions. The common
file is still called "pgstat.stat", the per-db files have the
database OID appended, so for example "pgstat.stat.12345" etc.This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
functions, introducing two new functions:pgstat_read_db_statsfile
pgstat_write_db_statsfilethat do the trick of reading/writing stat file for one database.
(2) The pgstat_read_statsfile has an additional parameter "onlydbs" that
says that you don't need table/func stats - just the list of db
entries. This is used for autovacuum launcher, which does not need
to read the table/stats (if I'm reading the code in autovacuum.c
correctly - it seems to be working as expected).
I'm not an expert on the stats system, but this seems like a promising
approach to me.
(a) It does not solve the "many-schema" scenario at all - that'll need
a completely new approach I guess :-(
We don't need to solve every problem in the first patch. I've got no
problem kicking this one down the road.
(b) It does not solve the writing part at all - the current code uses a
single timestamp (last_statwrite) to decide if a new file needs to
be written.That clearly is not enough for multiple files - there should be one
timestamp for each database/file. I'm thinking about how to solve
this and how to integrate it with pgstat_send_inquiry etc.
Presumably you need a last_statwrite for each file, in a hash table or
something, and requests need to specify which file is needed.
And yet another one I'm thinking about is using a fixed-length
array of timestamps (e.g. 256), indexed by mod(dboid,256). That
would mean stats for all databases with the same mod(oid,256) would
be written at the same time. Seems like an over-engineering though.
That seems like an unnecessary kludge.
(c) I'm a bit worried about the number of files - right now there's one
for each database and I'm thinking about splitting them by type
(one for tables, one for functions) which might make it even faster
for some apps with a lot of stored procedures etc.But is the large number of files actually a problem? After all,
we're using one file per relation fork in the "base" directory, so
this seems like a minor issue.
I don't see why one file per database would be a problem. After all,
we already have on directory per database inside base/. If the user
has so many databases that dirent lookups in a directory of that size
are a problem, they're already hosed, and this will probably still
work out to a net win.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 21.11.2012 19:02, Robert Haas wrote:
On Sun, Nov 18, 2012 at 5:49 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
The two main changes are these:
(1) The stats file is split into a common "db" file, containing all the
DB Entries, and per-database files with tables/functions. The common
file is still called "pgstat.stat", the per-db files have the
database OID appended, so for example "pgstat.stat.12345" etc.This was a trivial hack pgstat_read_statsfile/pgstat_write_statsfile
functions, introducing two new functions:pgstat_read_db_statsfile
pgstat_write_db_statsfilethat do the trick of reading/writing stat file for one database.
(2) The pgstat_read_statsfile has an additional parameter "onlydbs" that
says that you don't need table/func stats - just the list of db
entries. This is used for autovacuum launcher, which does not need
to read the table/stats (if I'm reading the code in autovacuum.c
correctly - it seems to be working as expected).I'm not an expert on the stats system, but this seems like a promising
approach to me.(a) It does not solve the "many-schema" scenario at all - that'll need
a completely new approach I guess :-(We don't need to solve every problem in the first patch. I've got no
problem kicking this one down the road.(b) It does not solve the writing part at all - the current code uses a
single timestamp (last_statwrite) to decide if a new file needs to
be written.That clearly is not enough for multiple files - there should be one
timestamp for each database/file. I'm thinking about how to solve
this and how to integrate it with pgstat_send_inquiry etc.Presumably you need a last_statwrite for each file, in a hash table or
something, and requests need to specify which file is needed.And yet another one I'm thinking about is using a fixed-length
array of timestamps (e.g. 256), indexed by mod(dboid,256). That
would mean stats for all databases with the same mod(oid,256) would
be written at the same time. Seems like an over-engineering though.That seems like an unnecessary kludge.
(c) I'm a bit worried about the number of files - right now there's one
for each database and I'm thinking about splitting them by type
(one for tables, one for functions) which might make it even faster
for some apps with a lot of stored procedures etc.But is the large number of files actually a problem? After all,
we're using one file per relation fork in the "base" directory, so
this seems like a minor issue.I don't see why one file per database would be a problem. After all,
we already have on directory per database inside base/. If the user
has so many databases that dirent lookups in a directory of that size
are a problem, they're already hosed, and this will probably still
work out to a net win.
Attached is a v2 of the patch, fixing some of the issues and unclear
points from the initial version.
The main improvement is that it implements writing only stats for the
requested database (set when sending inquiry). There's a dynamic array
of request - for each DB only the last request is kept.
I've done a number of changes - most importantly:
- added a stats_timestamp field to PgStat_StatDBEntry, keeping the last
write of the database (i.e. a per-database last_statwrite), which is
used to decide whether the file is stale or not
- handling of the 'permanent' flag correctly (used when starting or
stopping the cluster) for per-db files
- added a very simple header to the per-db files (basically just a
format ID and a timestamp) - this is needed for checking of the
timestamp of the last write from workers (although maybe we could
just read the pgstat.stat, which is now rather small)
- a 'force' parameter (true - write all databases, even if they weren't
specifically requested)
So with the exception of 'multi-schema' case (which was not the aim of
this effort), it should solve all the issues of the initial version.
There are two blocks of code dealing with clock glitches. I haven't
fixed those yet, but that can wait I guess. I've also left there some
logging I've used during development (printing inquiries and which file
is written and when).
The main unsolved problem I'm struggling with is what to do when a
database is dropped? Right now, the statfile remains in pg_stat_tmp
forewer (or until the restart) - is there a good way to remove the
file? I'm thinking about adding a message to be sent to the collector
from the code that handles DROP TABLE.
I've done some very simple performance testing - I've created 1000
databases with 1000 tables each, done ANALYZE on all of them. With only
autovacum running, I've seen this:
Without the patch
-----------------
%CPU %MEM TIME+ COMMAND
18 3.0 0:10.10 postgres: autovacuum launcher process
17 2.6 0:11.44 postgres: stats collector process
The I/O was seriously bogged down, doing ~150 MB/s (basically what the
drive can handle) - with less dbs, or when the statfiles are placed on
tmpfs filesystem, we usually see ~70% of one core doing just this.
With the patch
--------------
Then, the typical "top" for PostgreSQL processes looked like this:
%CPU %MEM TIME+ COMMAND
2 0.3 1:16.57 postgres: autovacuum launcher process
2 3.1 0:25.34 postgres: stats collector process
and the average write speed from the stats collector was ~3.5MB/s
(measured using iotop), and even when running the ANALYZE etc. I was
getting rather light IO usage (like ~15 MB/s or something).
With both cases, the total size was ~150MB, but without the space
requirements are actually 2x that (because of writing a copy and then
renaming).
I'd like to put this into 2013-01 commit fest, but if we can do some
prior testing / comments, that'd be great.
regards
Tomas
Attachments:
stats-split-v2.patchtext/plain; charset=UTF-8; name=stats-split-v2.patchDownload
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index be3adf1..63b9e14 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -222,8 +222,16 @@ static PgStat_GlobalStats globalStats;
/* Last time the collector successfully wrote the stats file */
static TimestampTz last_statwrite;
-/* Latest statistics request time from backends */
-static TimestampTz last_statrequest;
+/* Write request info for each database */
+typedef struct DBWriteRequest
+{
+ Oid databaseid; /* OID of the database to write */
+ TimestampTz request_time; /* timestamp of the last write request */
+} DBWriteRequest;
+
+/* Latest statistics request time from backends for each DB */
+static DBWriteRequest * last_statrequests = NULL;
+static int num_statrequests = 0;
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
@@ -252,8 +260,10 @@ static void pgstat_sighup_handler(SIGNAL_ARGS);
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
-static void pgstat_write_statsfile(bool permanent);
-static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
+static void pgstat_write_statsfile(bool permanent, bool force);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent);
+static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs);
+static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
@@ -285,6 +295,8 @@ static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int le
static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
+static bool pgstat_write_statsfile_needed();
+static bool pgstat_db_requested(Oid databaseid);
/* ------------------------------------------------------------
* Public functions called from postmaster follow
@@ -1408,13 +1420,14 @@ pgstat_ping(void)
* ----------
*/
static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
+pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
@@ -3004,6 +3017,7 @@ PgstatCollectorMain(int argc, char *argv[])
int len;
PgStat_Msg msg;
int wr;
+ bool first_write = true;
IsUnderPostmaster = true; /* we are a postmaster subprocess now */
@@ -3055,15 +3069,15 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Arrange to write the initial status file right away
*/
- last_statrequest = GetCurrentTimestamp();
- last_statwrite = last_statrequest - 1;
-
+ // last_statrequest = GetCurrentTimestamp();
+ // last_statwrite = GetCurrentTimestamp() - 1;
+
/*
* Read in an existing statistics stats file or initialize the stats to
* zero.
*/
pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, false);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -3109,8 +3123,11 @@ PgstatCollectorMain(int argc, char *argv[])
* Write the stats file if a new request has arrived that is not
* satisfied by existing file.
*/
- if (last_statwrite < last_statrequest)
- pgstat_write_statsfile(false);
+ if (first_write || pgstat_write_statsfile_needed())
+ {
+ pgstat_write_statsfile(false, first_write);
+ first_write = false;
+ }
/*
* Try to receive and process a message. This will not block,
@@ -3269,7 +3286,7 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Save the final stats to reuse at next startup.
*/
- pgstat_write_statsfile(true);
+ pgstat_write_statsfile(true, true);
exit(0);
}
@@ -3432,20 +3449,18 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
* ----------
*/
static void
-pgstat_write_statsfile(bool permanent)
+pgstat_write_statsfile(bool permanent, bool force)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
+ elog(WARNING, "writing statsfile '%s'", statfile);
+
/*
* Open the statistics temp file to write out the current values.
*/
@@ -3489,36 +3504,36 @@ pgstat_write_statsfile(bool permanent)
* use to any other process.
*/
fputc('D', fpout);
+ dbentry->stats_timestamp = globalStats.stats_timestamp;
rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
(void) rc; /* we'll check for error with ferror */
/*
- * Walk through the database's access stats per table.
+ * Write our the tables and functions into a separate file, but only
+ * if the database is in the requests or if it's a forced write (then
+ * all the DBs need to be written - e.g. at the shutdown).
*/
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
- {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
+ if (force || pgstat_db_requested(dbentry->databaseid)) {
+ elog(WARNING, "writing statsfile for DB %d", dbentry->databaseid);
+ pgstat_write_db_statsfile(dbentry, permanent);
}
/*
* Mark the end of this DB
+ *
+ * FIXME does it really make much sense, when the tables/functions
+ * are moved to a separate file (using those chars?)
*/
fputc('d', fpout);
}
+
+ /* In any case, we can just throw away all the db requests. */
+ if (last_statrequests != NULL)
+ {
+ pfree(last_statrequests);
+ last_statrequests = NULL;
+ num_statrequests = 0;
+ }
/*
* No more output to be done. Close the temp file and replace the old
@@ -3559,27 +3574,28 @@ pgstat_write_statsfile(bool permanent)
*/
last_statwrite = globalStats.stats_timestamp;
+ /* FIXME Update to the per-db request times. */
/*
* If there is clock skew between backends and the collector, we could
* receive a stats request time that's in the future. If so, complain
* and reset last_statrequest. Resetting ensures that no inquiry
* message can cause more than one stats file write to occur.
*/
- if (last_statrequest > last_statwrite)
- {
- char *reqtime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- reqtime = pstrdup(timestamptz_to_str(last_statrequest));
- mytime = pstrdup(timestamptz_to_str(last_statwrite));
- elog(LOG, "last_statrequest %s is later than collector's time %s",
- reqtime, mytime);
- pfree(reqtime);
- pfree(mytime);
-
- last_statrequest = last_statwrite;
- }
+// if (last_statrequest > last_statwrite)
+// {
+// char *reqtime;
+// char *mytime;
+//
+// /* Copy because timestamptz_to_str returns a static buffer */
+// reqtime = pstrdup(timestamptz_to_str(last_statrequest));
+// mytime = pstrdup(timestamptz_to_str(last_statwrite));
+// elog(LOG, "last_statrequest %s is later than collector's time %s",
+// reqtime, mytime);
+// pfree(reqtime);
+// pfree(mytime);
+//
+// last_statrequest = last_statwrite;
+// }
}
if (permanent)
@@ -3587,6 +3603,137 @@ pgstat_write_statsfile(bool permanent)
}
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *
+ * Tell the news.
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+{
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int32 format_id;
+ const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ int rc;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +2 for the dot and \0 byte
+ */
+ char db_tmpfile[strlen(tmpfile) + 12];
+ char db_statfile[strlen(statfile) + 12];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_tmpfile, strlen(tmpfile) + 12, "%s.%d", tmpfile, dbentry->databaseid);
+ snprintf(db_statfile, strlen(statfile) + 12, "%s.%d", statfile, dbentry->databaseid);
+
+ elog(WARNING, "writing statsfile '%s'", db_statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(db_tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ return;
+ }
+
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ FreeFile(fpout);
+ unlink(db_tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ unlink(db_tmpfile);
+ }
+ else if (rename(db_tmpfile, db_statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+ db_tmpfile, db_statfile)));
+ unlink(db_tmpfile);
+ }
+
+ if (permanent)
+ {
+ /* FIXME This aliases the existing db_statfile variable (might have different
+ * length). */
+ char db_statfile[strlen(pgstat_stat_filename) + 12];
+ snprintf(db_statfile, strlen(pgstat_stat_filename) + 12, "%s.%d",
+ pgstat_stat_filename, dbentry->databaseid);
+ elog(DEBUG1, "removing stat file '%s'", db_statfile);
+ unlink(db_statfile);
+ }
+}
+
/* ----------
* pgstat_read_statsfile() -
*
@@ -3595,14 +3742,10 @@ pgstat_write_statsfile(bool permanent)
* ----------
*/
static HTAB *
-pgstat_read_statsfile(Oid onlydb, bool permanent)
+pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
HTAB *tabhash = NULL;
@@ -3758,6 +3901,16 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
*/
tabhash = dbentry->tables;
funchash = dbentry->functions;
+
+ /*
+ * Read the data from the file for this database. If there was
+ * onlydb specified (!= InvalidOid), we would not get here because
+ * of a break above. So we don't need to recheck.
+ */
+ if (! onlydbs)
+ pgstat_read_db_statsfile(dbentry->databaseid, tabhash, funchash,
+ permanent);
+
break;
/*
@@ -3768,6 +3921,105 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
funchash = NULL;
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+done:
+ FreeFile(fpin);
+
+ if (permanent)
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+
+ return dbhash;
+}
+
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent)
+{
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ int32 format_id;
+ TimestampTz timestamp;
+ bool found;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +2 for the dot and \0 byte
+ */
+ char db_statfile[strlen(statfile) + 12];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_statfile, strlen(statfile) + 12, "%s.%d", statfile, databaseid);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(db_statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file \"%s\": %m",
+ db_statfile)));
+ return;
+ }
+
+ /*
+ * Verify it's of the expected format.
+ */
+ if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
+ || format_id != PGSTAT_FILE_FORMAT_ID)
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
+ goto done;
+ }
+
+ /*
+ * Read global stats struct
+ */
+ if (fread(×tamp, 1, sizeof(timestamp), fpin) != sizeof(timestamp))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
+ goto done;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
@@ -3777,7 +4029,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3795,7 +4047,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3811,7 +4063,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3829,7 +4081,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3845,7 +4097,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
default:
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
}
@@ -3854,37 +4106,49 @@ done:
FreeFile(fpin);
if (permanent)
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ {
+ /* FIXME This aliases the existing db_statfile variable (might have different
+ * length). */
+ char db_statfile[strlen(PGSTAT_STAT_PERMANENT_FILENAME) + 12];
+ snprintf(db_statfile, strlen(PGSTAT_STAT_PERMANENT_FILENAME) + 12, "%s.%d",
+ PGSTAT_STAT_PERMANENT_FILENAME, databaseid);
+ elog(DEBUG1, "removing permanent stats file '%s'", db_statfile);
+ unlink(db_statfile);
+ }
- return dbhash;
+ return;
}
/* ----------
- * pgstat_read_statsfile_timestamp() -
+ * pgstat_read_db_statsfile_timestamp() -
*
- * Attempt to fetch the timestamp of an existing stats file.
+ * Attempt to fetch the timestamp of an existing stats file (for a DB).
* Returns TRUE if successful (timestamp is stored at *ts).
* ----------
*/
static bool
-pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
+pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, TimestampTz *ts)
{
- PgStat_GlobalStats myGlobalStats;
+ TimestampTz timestamp;
FILE *fpin;
int32 format_id;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ char db_statfile[strlen(statfile) + 12];
+
+ /* format the db statfile filename */
+ snprintf(db_statfile, strlen(statfile) + 12, "%s.%d", statfile, databaseid);
/*
* Try to open the status file. As above, anything but ENOENT is worthy
* of complaining about.
*/
- if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ if ((fpin = AllocateFile(db_statfile, PG_BINARY_R)) == NULL)
{
if (errno != ENOENT)
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errcode_for_file_access(),
errmsg("could not open statistics file \"%s\": %m",
- statfile)));
+ db_statfile)));
return false;
}
@@ -3895,7 +4159,7 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
|| format_id != PGSTAT_FILE_FORMAT_ID)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
FreeFile(fpin);
return false;
}
@@ -3903,15 +4167,15 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
/*
* Read global stats struct
*/
- if (fread(&myGlobalStats, 1, sizeof(myGlobalStats), fpin) != sizeof(myGlobalStats))
+ if (fread(×tamp, 1, sizeof(TimestampTz), fpin) != sizeof(TimestampTz))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
FreeFile(fpin);
return false;
}
- *ts = myGlobalStats.stats_timestamp;
+ *ts = timestamp;
FreeFile(fpin);
return true;
@@ -3947,7 +4211,7 @@ backend_read_statsfile(void)
CHECK_FOR_INTERRUPTS();
- ok = pgstat_read_statsfile_timestamp(false, &file_ts);
+ ok = pgstat_read_db_statsfile_timestamp(MyDatabaseId, false, &file_ts);
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
@@ -4006,7 +4270,7 @@ backend_read_statsfile(void)
pfree(mytime);
}
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
break;
}
@@ -4016,7 +4280,7 @@ backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
@@ -4026,9 +4290,16 @@ backend_read_statsfile(void)
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
+ /*
+ * FIXME Does it really need info including tables/functions? Or is it enough to read
+ * database-level stats? It seems to me the launcher needs PgStat_StatDBEntry only
+ * (at least that's how I understand the rebuild_database_list() in autovacuum.c),
+ * because pgstat_stattabentries are used in do_autovacuum() only, that that's what's
+ * executed in workers ... So maybe we'd be just fine by reading in the dbentries?
+ */
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, true);
else
- pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
+ pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false, false);
}
@@ -4084,13 +4355,53 @@ pgstat_clear_snapshot(void)
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
+ int i = 0;
+ bool found = false;
+
+ elog(WARNING, "received inquiry for %d", msg->databaseid);
+
/*
- * Advance last_statrequest if this requestor has a newer cutoff time
- * than any previous request.
+ * Find the last write request for this DB (found=true in that case). Plain
+ * linear search, not really worth doing any magic here (probably).
*/
- if (msg->cutoff_time > last_statrequest)
- last_statrequest = msg->cutoff_time;
+ for (i = 0; i < num_statrequests; i++)
+ {
+ if (last_statrequests[i].databaseid == msg->databaseid)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ /*
+ * There already is a request for this DB, so lets advance the
+ * request time if this requestor has a newer cutoff time
+ * than any previous request.
+ */
+ if (msg->cutoff_time > last_statrequests[i].request_time)
+ last_statrequests[i].request_time = msg->cutoff_time;
+ }
+ else
+ {
+ /*
+ * There's no request for this DB yet, so lets create it (allocate a
+ * space for it, set the values).
+ */
+ if (last_statrequests == NULL)
+ last_statrequests = palloc(sizeof(DBWriteRequest));
+ else
+ last_statrequests = repalloc(last_statrequests,
+ (num_statrequests + 1)*sizeof(DBWriteRequest));
+
+ last_statrequests[num_statrequests].databaseid = msg->databaseid;
+ last_statrequests[num_statrequests].request_time = msg->clock_time;
+ num_statrequests += 1;
+ }
+ /* FIXME Do we need to update this to work with per-db stats? This should
+ * be moved to the "else" branch I guess. */
/*
* If the requestor's local clock time is older than last_statwrite, we
* should suspect a clock glitch, ie system time going backwards; though
@@ -4099,31 +4410,31 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
* retreat in the system clock reading could otherwise cause us to neglect
* to update the stats file for a long time.
*/
- if (msg->clock_time < last_statwrite)
- {
- TimestampTz cur_ts = GetCurrentTimestamp();
-
- if (cur_ts < last_statwrite)
- {
- /*
- * Sure enough, time went backwards. Force a new stats file write
- * to get back in sync; but first, log a complaint.
- */
- char *writetime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- writetime = pstrdup(timestamptz_to_str(last_statwrite));
- mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG, "last_statwrite %s is later than collector's time %s",
- writetime, mytime);
- pfree(writetime);
- pfree(mytime);
-
- last_statrequest = cur_ts;
- last_statwrite = last_statrequest - 1;
- }
- }
+// if (msg->clock_time < last_statwrite)
+// {
+// TimestampTz cur_ts = GetCurrentTimestamp();
+//
+// if (cur_ts < last_statwrite)
+// {
+// /*
+// * Sure enough, time went backwards. Force a new stats file write
+// * to get back in sync; but first, log a complaint.
+// */
+// char *writetime;
+// char *mytime;
+//
+// /* Copy because timestamptz_to_str returns a static buffer */
+// writetime = pstrdup(timestamptz_to_str(last_statwrite));
+// mytime = pstrdup(timestamptz_to_str(cur_ts));
+// elog(LOG, "last_statwrite %s is later than collector's time %s",
+// writetime, mytime);
+// pfree(writetime);
+// pfree(mytime);
+//
+// last_statrequest = cur_ts;
+// last_statwrite = last_statrequest - 1;
+// }
+// }
}
@@ -4687,3 +4998,54 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
HASH_REMOVE, NULL);
}
}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether there's a db stats request, requiring a file write.
+ * ----------
+ */
+
+static bool pgstat_write_statsfile_needed()
+{
+ int i = 0;
+ PgStat_StatDBEntry *dbentry;
+
+ /* Check the databases if they need to refresh the stats. */
+ for (i = 0; i < num_statrequests; i++)
+ {
+ dbentry = pgstat_get_db_entry(last_statrequests[i].databaseid, false);
+
+ /* No dbentry yet or too old. */
+ if ((! dbentry) ||
+ (dbentry->stats_timestamp < last_statrequests[i].request_time)) {
+ return true;
+ }
+
+ }
+
+ /* Well, everything was written recently ... */
+ return false;
+}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether stats for a particular DB need to be written to a file).
+ * ----------
+ */
+
+static bool
+pgstat_db_requested(Oid databaseid)
+{
+ int i = 0;
+
+ /* Check the databases if they need to refresh the stats. */
+ for (i = 0; i < num_statrequests; i++)
+ {
+ if (last_statrequests[i].databaseid == databaseid)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 613c1c2..bdb1bbc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB (InvalidOid => all DBs) */
} PgStat_MsgInquiry;
@@ -545,6 +546,7 @@ typedef struct PgStat_StatDBEntry
PgStat_Counter n_block_write_time;
TimestampTz stat_reset_timestamp;
+ TimestampTz stats_timestamp; /* time of db stats file update */
/*
* tables and functions must be last in the struct, because we don't write
Hi,
attached is a version of the patch that I believe is ready for the
commitfest. As the patch was discussed over a large number of messages,
I've prepared a brief summary for those who did not follow the thread.
Issue
=====
The patch aims to improve the situation in deployments with many tables
in many databases (think for example 1000 tables in 1000 databases).
Currently all the stats for all the objects (dbs, tables and functions)
are written in a single file (pgstat.stat), which may get quite large
and that consequently leads to various issues:
1) I/O load - the file is frequently written / read, which may use a
significant part of the I/O bandwidth. For example we'have to deal with
cases when the pgstat.stat size is >150MB, and it's written (and read)
continuously (once it's written, a new write starts) and utilizes 100%
bandwidth on that device.
2) CPU load - a common solution to the previous issue is moving the file
into RAM, using a tmpfs filesystem. That "fixes" the I/O bottleneck but
causes high CPU load because the system is serializing and deserializing
large amounts of data. We often see ~1 CPU core "lost" due to this (and
causing higher power consumption, but that's Amazon's problem ;-)
3) disk space utilization - the pgstat.stat file is updated in two
steps, i.e. a new version is written to another file (pgstat.tmp) and
then it's renamed to pgstat.stat, which means the device (amount of RAM,
if using tmpfs device) needs to be >2x the actual size of the file.
(Actually more, because there may be descriptors open to multiple
versions of the file.)
This patch does not attempt to fix a "single DB with multiple schemas"
scenario, although it should not have a negative impact on it.
What the patch does
===================
1) split into global and per-db files
-------------------------------------
The patch "splits" the huge pgstat.stat file into smaller pieces - one
"global" one (global.stat) with database stats, and one file for each of
the databases (oid.stat) with table.
This makes it possible to write/read much smaller amounts of data, because
a) autovacuum launcher does not need to read the whole file - it needs
just the list of databases (and not the table/func stats)
b) autovacuum workers do request a fresh copy of a single database, so
the stats collector may write just the global.stat + one of the per-db files
and that consequently leads to much lower I/O and CPU load. During our
tests we've seen the I/O to drop from ~150MB/s to less than 4MB/s, and
much lower CPU utilization.
2) a new global/stat directory
------------------------------
The pgstat.stat file was originally saved into the "global" directory,
but with so many files that would get rather messy so I've created a new
global/stat directory and all the files are stored there.
This also means we can do a simple "delete files in the dir" when
pgstat_reset_all is called.
3) pgstat_(read|write)_statsfile split
--------------------------------------
These two functions were moved into a global and per-db functions, so
now there's
pgstat_write_statsfile -- global.stat
pgstat_write_db_statsfile -- oid.stat
pgstat_read_statsfile -- global.stat
pgstat_read_db_statsfile -- oid.stat
There's a logic to read/write only those files that are actually needed.
4) list of (OID, timestamp) inquiries, last db-write
----------------------------------------------------
Originally there was a single pair of request/write timestamps for the
whole file, updated whenever a worker requested a fresh file or when the
file was written.
With the split, this had to be replaced by two lists - a timestamp of
the last inquiry (per DB), and a timestamp when each database file was
written for the last time.
The timestamp of the last DB write was added to the PgStat_StatDBEntry
and the list of inquiries is kept in last_statrequests. The fields are
used at several places, so it's probably best to see the code.
Handling the timestamps is a rather complex stuff because of the clock
skews. One of those checks is not needed as the list of inquiries is
freed right after writing all the databases. But I wouldn't be surprised
if there was something I missed, as splitting the file into multiple
pieces made this part more complex.
So please, if you're going to review this patch this is one of the
tricky places.
5) dummy file
-------------
A special handling is necessary when an inquiry arrives for a database
without a PgStat_StatDBEntry - this happens for example right after
initdb, when there are no stats for template0 and template1, yet the
autovacuum workers do send inqiries for them.
The backend_read_statsfile now uses the timestamp stored in the header
of the per-db file (not in the global one), and the easies way to handle
this for new databases is writing an empty 'dummy file' (just a header
with timestamp). Without this, this would result in 'pgstat wait
timeout' errors.
This is what pgstat_write_db_dummyfile (used in pgstat_write_statsfile)
is for.
6) format ID
------------
I've bumped PGSTAT_FILE_FORMAT_ID to a new random value, although the
filenames changed to so we could live with the old ID just fine.
We've done a fair amount of testing so far, and if everything goes fine
we plan to deploy a back-ported version of this patch (to 9.1) on a
production in ~2 weeks.
Then I'll be able to provide some numbers from a real-world workload
(although our deployment and workload is not quite usual I guess).
regards
Attachments:
stats-split-v4.patchtext/plain; charset=UTF-8; name=stats-split-v4.patchDownload
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index be3adf1..37b85e6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -64,10 +64,14 @@
/* ----------
* Paths for the statistics files (relative to installation's $PGDATA).
+ * Permanent and temprorary, global and per-database files.
* ----------
*/
-#define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp"
+#define PGSTAT_STAT_PERMANENT_DIRECTORY "global/stat"
+#define PGSTAT_STAT_PERMANENT_FILENAME "global/stat/global.stat"
+#define PGSTAT_STAT_PERMANENT_TMPFILE "global/stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_DB_FILENAME "global/stat/%d.stat"
+#define PGSTAT_STAT_PERMANENT_DB_TMPFILE "global/stat/%d.tmp"
/* ----------
* Timer definitions.
@@ -115,8 +119,11 @@ int pgstat_track_activity_query_size = 1024;
* Built from GUC parameter
* ----------
*/
+char *pgstat_stat_directory = NULL;
char *pgstat_stat_filename = NULL;
char *pgstat_stat_tmpname = NULL;
+char *pgstat_stat_db_filename = NULL;
+char *pgstat_stat_db_tmpname = NULL;
/*
* BgWriter global statistics counters (unused in other processes).
@@ -219,11 +226,16 @@ static int localNumBackends = 0;
*/
static PgStat_GlobalStats globalStats;
-/* Last time the collector successfully wrote the stats file */
-static TimestampTz last_statwrite;
+/* Write request info for each database */
+typedef struct DBWriteRequest
+{
+ Oid databaseid; /* OID of the database to write */
+ TimestampTz request_time; /* timestamp of the last write request */
+} DBWriteRequest;
-/* Latest statistics request time from backends */
-static TimestampTz last_statrequest;
+/* Latest statistics request time from backends for each DB */
+static DBWriteRequest * last_statrequests = NULL;
+static int num_statrequests = 0;
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
@@ -252,11 +264,17 @@ static void pgstat_sighup_handler(SIGNAL_ARGS);
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
-static void pgstat_write_statsfile(bool permanent);
-static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
+static void pgstat_write_statsfile(bool permanent, bool force);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent);
+static void pgstat_write_db_dummyfile(Oid databaseid);
+static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs);
+static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
+static bool pgstat_write_statsfile_needed();
+static bool pgstat_db_requested(Oid databaseid);
+
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
static HTAB *pgstat_collect_oids(Oid catalogid);
@@ -285,7 +303,6 @@ static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int le
static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
@@ -549,8 +566,34 @@ startup_failed:
void
pgstat_reset_all(void)
{
- unlink(pgstat_stat_filename);
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ DIR * dir;
+ struct dirent * entry;
+
+ dir = AllocateDir(pgstat_stat_directory);
+ while ((entry = ReadDir(dir, pgstat_stat_directory)) != NULL)
+ {
+ char fname[strlen(pgstat_stat_directory) + strlen(entry->d_name) + 1];
+
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ sprintf(fname, "%s/%s", pgstat_stat_directory, entry->d_name);
+ unlink(fname);
+ }
+ FreeDir(dir);
+
+ dir = AllocateDir(PGSTAT_STAT_PERMANENT_DIRECTORY);
+ while ((entry = ReadDir(dir, PGSTAT_STAT_PERMANENT_DIRECTORY)) != NULL)
+ {
+ char fname[strlen(PGSTAT_STAT_PERMANENT_FILENAME) + strlen(entry->d_name) + 1];
+
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ sprintf(fname, "%s/%s", PGSTAT_STAT_PERMANENT_FILENAME, entry->d_name);
+ unlink(fname);
+ }
+ FreeDir(dir);
}
#ifdef EXEC_BACKEND
@@ -1408,13 +1451,14 @@ pgstat_ping(void)
* ----------
*/
static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
+pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
@@ -3004,6 +3048,7 @@ PgstatCollectorMain(int argc, char *argv[])
int len;
PgStat_Msg msg;
int wr;
+ bool first_write = true;
IsUnderPostmaster = true; /* we are a postmaster subprocess now */
@@ -3053,17 +3098,11 @@ PgstatCollectorMain(int argc, char *argv[])
init_ps_display("stats collector process", "", "", "");
/*
- * Arrange to write the initial status file right away
- */
- last_statrequest = GetCurrentTimestamp();
- last_statwrite = last_statrequest - 1;
-
- /*
* Read in an existing statistics stats file or initialize the stats to
- * zero.
+ * zero (read data for all databases, including table/func stats).
*/
pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, false);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -3107,10 +3146,14 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Write the stats file if a new request has arrived that is not
- * satisfied by existing file.
+ * satisfied by existing file (force writing all files if it's
+ * the first write after startup).
*/
- if (last_statwrite < last_statrequest)
- pgstat_write_statsfile(false);
+ if (first_write || pgstat_write_statsfile_needed())
+ {
+ pgstat_write_statsfile(false, first_write);
+ first_write = false;
+ }
/*
* Try to receive and process a message. This will not block,
@@ -3269,7 +3312,7 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Save the final stats to reuse at next startup.
*/
- pgstat_write_statsfile(true);
+ pgstat_write_statsfile(true, true);
exit(0);
}
@@ -3429,23 +3472,25 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
* shutting down only), remove the temporary file so that backends
* starting up under a new postmaster can't read the old data before
* the new collector is ready.
+ *
+ * When the 'force' is false, only the requested databases (listed in
+ * last_statrequests) will be written. If 'force' is true, all databases
+ * will be written (this is used e.g. at shutdown).
* ----------
*/
static void
-pgstat_write_statsfile(bool permanent)
+pgstat_write_statsfile(bool permanent, bool force)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
/*
* Open the statistics temp file to write out the current values.
*/
@@ -3484,6 +3529,20 @@ pgstat_write_statsfile(bool permanent)
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
{
/*
+ * Write our the tables and functions into a separate file, but only
+ * if the database is in the requests or if it's a forced write (then
+ * all the DBs need to be written - e.g. at the shutdown).
+ *
+ * We need to do this before the dbentry write to write the proper
+ * timestamp to the global file.
+ */
+ if (force || pgstat_db_requested(dbentry->databaseid)) {
+ elog(DEBUG1, "writing statsfile for DB %d", dbentry->databaseid);
+ dbentry->stats_timestamp = globalStats.stats_timestamp;
+ pgstat_write_db_statsfile(dbentry, permanent);
+ }
+
+ /*
* Write out the DB entry including the number of live backends. We
* don't write the tables or functions pointers, since they're of no
* use to any other process.
@@ -3493,29 +3552,10 @@ pgstat_write_statsfile(bool permanent)
(void) rc; /* we'll check for error with ferror */
/*
- * Walk through the database's access stats per table.
- */
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
- {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
* Mark the end of this DB
+ *
+ * TODO Does using these chars still make sense, when the tables/func
+ * stats are moved to a separate file?
*/
fputc('d', fpout);
}
@@ -3527,6 +3567,28 @@ pgstat_write_statsfile(bool permanent)
*/
fputc('E', fpout);
+ /* In any case, we can just throw away all the db requests, but we need to
+ * write dummy files for databases without a stat entry (it would cause
+ * issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts).
+ * This may happend e.g. for shared DB (oid = 0) right after initdb.
+ */
+ if (last_statrequests != NULL)
+ {
+ int i = 0;
+ for (i = 0; i < num_statrequests; i++)
+ {
+ /* Create dummy files for requested databases without a proper
+ * dbentry. It's much easier this way than dealing with multiple
+ * timestamps, possibly existing but not yet written DBs etc. */
+ if (! pgstat_get_db_entry(last_statrequests[i].databaseid, false))
+ pgstat_write_db_dummyfile(last_statrequests[i].databaseid);
+ }
+
+ pfree(last_statrequests);
+ last_statrequests = NULL;
+ num_statrequests = 0;
+ }
+
if (ferror(fpout))
{
ereport(LOG,
@@ -3552,57 +3614,247 @@ pgstat_write_statsfile(bool permanent)
tmpfile, statfile)));
unlink(tmpfile);
}
- else
+
+ if (permanent)
+ unlink(pgstat_stat_filename);
+}
+
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *
+ * Tell the news. This writes stats file for a single database.
+ *
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+{
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int32 format_id;
+ const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_DB_TMPFILE : pgstat_stat_db_tmpname;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+ int rc;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_tmpfile[strlen(tmpfile) + 11];
+ char db_statfile[strlen(statfile) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_tmpfile, strlen(tmpfile) + 11, tmpfile, dbentry->databaseid);
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, dbentry->databaseid);
+
+ elog(DEBUG1, "writing statsfile '%s'", db_statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(db_tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
{
- /*
- * Successful write, so update last_statwrite.
- */
- last_statwrite = globalStats.stats_timestamp;
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ return;
+ }
- /*
- * If there is clock skew between backends and the collector, we could
- * receive a stats request time that's in the future. If so, complain
- * and reset last_statrequest. Resetting ensures that no inquiry
- * message can cause more than one stats file write to occur.
- */
- if (last_statrequest > last_statwrite)
- {
- char *reqtime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- reqtime = pstrdup(timestamptz_to_str(last_statrequest));
- mytime = pstrdup(timestamptz_to_str(last_statwrite));
- elog(LOG, "last_statrequest %s is later than collector's time %s",
- reqtime, mytime);
- pfree(reqtime);
- pfree(mytime);
-
- last_statrequest = last_statwrite;
- }
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
}
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ FreeFile(fpout);
+ unlink(db_tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ unlink(db_tmpfile);
+ }
+ else if (rename(db_tmpfile, db_statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+ db_tmpfile, db_statfile)));
+ unlink(db_tmpfile);
+ }
+
if (permanent)
- unlink(pgstat_stat_filename);
+ {
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11,
+ pgstat_stat_db_filename, dbentry->databaseid);
+ elog(DEBUG1, "removing temporary stat file '%s'", db_statfile);
+ unlink(db_statfile);
+ }
}
/* ----------
+ * pgstat_write_db_dummyfile() -
+ *
+ * All this does is writing a dummy stat file for databases without dbentry
+ * yet. It basically writes just a file header - format ID and a timestamp.
+ * ----------
+ */
+static void
+pgstat_write_db_dummyfile(Oid databaseid)
+{
+ FILE *fpout;
+ int32 format_id;
+ int rc;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_tmpfile[strlen(pgstat_stat_db_tmpname) + 11];
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_tmpfile, strlen(pgstat_stat_db_tmpname) + 11, pgstat_stat_db_tmpname, databaseid);
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11, pgstat_stat_db_filename, databaseid);
+
+ elog(DEBUG1, "writing statsfile '%s'", db_statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(db_tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ return;
+ }
+
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary dummy statistics file \"%s\": %m",
+ db_tmpfile)));
+ FreeFile(fpout);
+ unlink(db_tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary dummy statistics file \"%s\": %m",
+ db_tmpfile)));
+ unlink(db_tmpfile);
+ }
+ else if (rename(db_tmpfile, db_statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary dummy statistics file \"%s\" to \"%s\": %m",
+ db_tmpfile, db_statfile)));
+ unlink(db_tmpfile);
+ }
+
+}
+
+/* ----------
* pgstat_read_statsfile() -
*
* Reads in an existing statistics collector file and initializes the
* databases' hash table (whose entries point to the tables' hash tables).
+ *
+ * Allows reading only the global stats (at database level), which is just
+ * enough for many purposes (e.g. autovacuum launcher etc.). If this is
+ * sufficient for you, use onlydbs=true.
* ----------
*/
static HTAB *
-pgstat_read_statsfile(Oid onlydb, bool permanent)
+pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
HTAB *tabhash = NULL;
@@ -3613,6 +3865,11 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
/*
+ * If we want a db-level stats only, we don't want a particular db.
+ */
+ Assert(!((onlydb != InvalidOid) && onlydbs));
+
+ /*
* The tables will live in pgStatLocalContext.
*/
pgstat_setup_memcxt();
@@ -3758,6 +4015,16 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
*/
tabhash = dbentry->tables;
funchash = dbentry->functions;
+
+ /*
+ * Read the data from the file for this database. If there was
+ * onlydb specified (!= InvalidOid), we would not get here because
+ * of a break above. So we don't need to recheck.
+ */
+ if (! onlydbs)
+ pgstat_read_db_statsfile(dbentry->databaseid, tabhash, funchash,
+ permanent);
+
break;
/*
@@ -3768,6 +4035,105 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
funchash = NULL;
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+done:
+ FreeFile(fpin);
+
+ if (permanent)
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+
+ return dbhash;
+}
+
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent)
+{
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ int32 format_id;
+ TimestampTz timestamp;
+ bool found;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_statfile[strlen(statfile) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, databaseid);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(db_statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file \"%s\": %m",
+ db_statfile)));
+ return;
+ }
+
+ /*
+ * Verify it's of the expected format.
+ */
+ if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
+ || format_id != PGSTAT_FILE_FORMAT_ID)
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
+ goto done;
+ }
+
+ /*
+ * Read global stats struct
+ */
+ if (fread(×tamp, 1, sizeof(timestamp), fpin) != sizeof(timestamp))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
+ goto done;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
@@ -3777,7 +4143,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3795,7 +4161,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3811,7 +4177,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3829,7 +4195,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3845,7 +4211,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
default:
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
}
@@ -3854,37 +4220,47 @@ done:
FreeFile(fpin);
if (permanent)
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ {
+ char db_statfile[strlen(PGSTAT_STAT_PERMANENT_DB_FILENAME) + 11];
+ snprintf(db_statfile, strlen(PGSTAT_STAT_PERMANENT_DB_FILENAME) + 11,
+ PGSTAT_STAT_PERMANENT_DB_FILENAME, databaseid);
+ elog(DEBUG1, "removing permanent stats file '%s'", db_statfile);
+ unlink(db_statfile);
+ }
- return dbhash;
+ return;
}
/* ----------
- * pgstat_read_statsfile_timestamp() -
+ * pgstat_read_db_statsfile_timestamp() -
*
- * Attempt to fetch the timestamp of an existing stats file.
+ * Attempt to fetch the timestamp of an existing stats file (for a DB).
* Returns TRUE if successful (timestamp is stored at *ts).
* ----------
*/
static bool
-pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
+pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, TimestampTz *ts)
{
- PgStat_GlobalStats myGlobalStats;
+ TimestampTz timestamp;
FILE *fpin;
int32 format_id;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+ char db_statfile[strlen(statfile) + 11];
+
+ /* format the db statfile filename */
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, databaseid);
/*
* Try to open the status file. As above, anything but ENOENT is worthy
* of complaining about.
*/
- if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ if ((fpin = AllocateFile(db_statfile, PG_BINARY_R)) == NULL)
{
if (errno != ENOENT)
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errcode_for_file_access(),
errmsg("could not open statistics file \"%s\": %m",
- statfile)));
+ db_statfile)));
return false;
}
@@ -3895,7 +4271,7 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
|| format_id != PGSTAT_FILE_FORMAT_ID)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
FreeFile(fpin);
return false;
}
@@ -3903,15 +4279,15 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
/*
* Read global stats struct
*/
- if (fread(&myGlobalStats, 1, sizeof(myGlobalStats), fpin) != sizeof(myGlobalStats))
+ if (fread(×tamp, 1, sizeof(TimestampTz), fpin) != sizeof(TimestampTz))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
FreeFile(fpin);
return false;
}
- *ts = myGlobalStats.stats_timestamp;
+ *ts = timestamp;
FreeFile(fpin);
return true;
@@ -3947,7 +4323,7 @@ backend_read_statsfile(void)
CHECK_FOR_INTERRUPTS();
- ok = pgstat_read_statsfile_timestamp(false, &file_ts);
+ ok = pgstat_read_db_statsfile_timestamp(MyDatabaseId, false, &file_ts);
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
@@ -4006,7 +4382,7 @@ backend_read_statsfile(void)
pfree(mytime);
}
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
break;
}
@@ -4016,7 +4392,7 @@ backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
@@ -4026,9 +4402,16 @@ backend_read_statsfile(void)
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
+ /*
+ * FIXME Does it really need info including tables/functions? Or is it enough to read
+ * database-level stats? It seems to me the launcher needs PgStat_StatDBEntry only
+ * (at least that's how I understand the rebuild_database_list() in autovacuum.c),
+ * because pgstat_stattabentries are used in do_autovacuum() only, that that's what's
+ * executed in workers ... So maybe we'd be just fine by reading in the dbentries?
+ */
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, true);
else
- pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
+ pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false, false);
}
@@ -4084,44 +4467,84 @@ pgstat_clear_snapshot(void)
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
- /*
- * Advance last_statrequest if this requestor has a newer cutoff time
- * than any previous request.
- */
- if (msg->cutoff_time > last_statrequest)
- last_statrequest = msg->cutoff_time;
+ int i = 0;
+ bool found = false;
+ PgStat_StatDBEntry *dbentry;
+
+ elog(DEBUG1, "received inquiry for %d", msg->databaseid);
/*
- * If the requestor's local clock time is older than last_statwrite, we
- * should suspect a clock glitch, ie system time going backwards; though
- * the more likely explanation is just delayed message receipt. It is
- * worth expending a GetCurrentTimestamp call to be sure, since a large
- * retreat in the system clock reading could otherwise cause us to neglect
- * to update the stats file for a long time.
+ * Find the last write request for this DB (found=true in that case). Plain
+ * linear search, not really worth doing any magic here (probably).
*/
- if (msg->clock_time < last_statwrite)
+ for (i = 0; i < num_statrequests; i++)
+ {
+ if (last_statrequests[i].databaseid == msg->databaseid)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ /*
+ * There already is a request for this DB, so lets advance the
+ * request time if this requestor has a newer cutoff time
+ * than any previous request.
+ */
+ if (msg->cutoff_time > last_statrequests[i].request_time)
+ last_statrequests[i].request_time = msg->cutoff_time;
+ }
+ else
{
- TimestampTz cur_ts = GetCurrentTimestamp();
+ /*
+ * There's no request for this DB yet, so lets create it (allocate a
+ * space for it, set the values).
+ */
+ if (last_statrequests == NULL)
+ last_statrequests = palloc(sizeof(DBWriteRequest));
+ else
+ last_statrequests = repalloc(last_statrequests,
+ (num_statrequests + 1)*sizeof(DBWriteRequest));
+
+ last_statrequests[num_statrequests].databaseid = msg->databaseid;
+ last_statrequests[num_statrequests].request_time = msg->clock_time;
+ num_statrequests += 1;
- if (cur_ts < last_statwrite)
+ /*
+ * If the requestor's local clock time is older than last_statwrite, we
+ * should suspect a clock glitch, ie system time going backwards; though
+ * the more likely explanation is just delayed message receipt. It is
+ * worth expending a GetCurrentTimestamp call to be sure, since a large
+ * retreat in the system clock reading could otherwise cause us to neglect
+ * to update the stats file for a long time.
+ */
+ dbentry = pgstat_get_db_entry(msg->databaseid, false);
+ if ((dbentry != NULL) && (msg->clock_time < dbentry->stats_timestamp))
{
- /*
- * Sure enough, time went backwards. Force a new stats file write
- * to get back in sync; but first, log a complaint.
- */
- char *writetime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- writetime = pstrdup(timestamptz_to_str(last_statwrite));
- mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG, "last_statwrite %s is later than collector's time %s",
- writetime, mytime);
- pfree(writetime);
- pfree(mytime);
-
- last_statrequest = cur_ts;
- last_statwrite = last_statrequest - 1;
+ TimestampTz cur_ts = GetCurrentTimestamp();
+
+ if (cur_ts < dbentry->stats_timestamp)
+ {
+ /*
+ * Sure enough, time went backwards. Force a new stats file write
+ * to get back in sync; but first, log a complaint.
+ */
+ char *writetime;
+ char *mytime;
+
+ /* Copy because timestamptz_to_str returns a static buffer */
+ writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
+ mytime = pstrdup(timestamptz_to_str(cur_ts));
+ elog(LOG, "last_statwrite %s is later than collector's time %s for "
+ "db %d", writetime, mytime, dbentry->databaseid);
+ pfree(writetime);
+ pfree(mytime);
+
+ last_statrequests[num_statrequests].request_time = cur_ts;
+ dbentry->stats_timestamp = cur_ts - 1;
+ }
}
}
}
@@ -4278,10 +4701,17 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
/*
- * If found, remove it.
+ * If found, remove it (along with the db statfile).
*/
if (dbentry)
{
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11,
+ pgstat_stat_filename, dbentry->databaseid);
+
+ elog(DEBUG1, "removing %s", db_statfile);
+ unlink(db_statfile);
+
if (dbentry->tables != NULL)
hash_destroy(dbentry->tables);
if (dbentry->functions != NULL)
@@ -4687,3 +5117,58 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
HASH_REMOVE, NULL);
}
}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether there's a db stats request, requiring a file write.
+ *
+ * TODO Seems that thanks the way we handle last_statrequests (erase after
+ * a write), this is unnecessary. Just check that there's at least one
+ * request and you're done. Although there might be delayed requests ...
+ * ----------
+ */
+
+static bool pgstat_write_statsfile_needed()
+{
+ int i = 0;
+ PgStat_StatDBEntry *dbentry;
+
+ /* Check the databases if they need to refresh the stats. */
+ for (i = 0; i < num_statrequests; i++)
+ {
+ dbentry = pgstat_get_db_entry(last_statrequests[i].databaseid, false);
+
+ /* No dbentry yet or too old. */
+ if ((! dbentry) ||
+ (dbentry->stats_timestamp < last_statrequests[i].request_time)) {
+ return true;
+ }
+
+ }
+
+ /* Well, everything was written recently ... */
+ return false;
+}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether stats for a particular DB need to be written to a file).
+ * ----------
+ */
+
+static bool
+pgstat_db_requested(Oid databaseid)
+{
+ int i = 0;
+
+ /* Check the databases if they need to refresh the stats. */
+ for (i = 0; i < num_statrequests; i++)
+ {
+ if (last_statrequests[i].databaseid == databaseid)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2cf34ce..e3e432b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -8730,20 +8730,43 @@ static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
/* check_canonical_path already canonicalized newval for us */
+ char *dname;
char *tname;
char *fname;
-
- tname = guc_malloc(ERROR, strlen(newval) + 12); /* /pgstat.tmp */
- sprintf(tname, "%s/pgstat.tmp", newval);
- fname = guc_malloc(ERROR, strlen(newval) + 13); /* /pgstat.stat */
- sprintf(fname, "%s/pgstat.stat", newval);
-
+ char *tname_db;
+ char *fname_db;
+
+ /* directory */
+ dname = guc_malloc(ERROR, strlen(newval) + 1); /* runtime dir */
+ sprintf(dname, "%s", newval);
+
+ /* global stats */
+ tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
+ sprintf(tname, "%s/global.tmp", newval);
+ fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
+ sprintf(fname, "%s/global.stat", newval);
+
+ /* per-db stats */
+ tname_db = guc_malloc(ERROR, strlen(newval) + 8); /* /%d.tmp */
+ sprintf(tname_db, "%s/%%d.tmp", newval);
+ fname_db = guc_malloc(ERROR, strlen(newval) + 9); /* /%d.stat */
+ sprintf(fname_db, "%s/%%d.stat", newval);
+
+ if (pgstat_stat_directory)
+ free(pgstat_stat_directory);
+ pgstat_stat_directory = dname;
if (pgstat_stat_tmpname)
free(pgstat_stat_tmpname);
pgstat_stat_tmpname = tname;
if (pgstat_stat_filename)
free(pgstat_stat_filename);
pgstat_stat_filename = fname;
+ if (pgstat_stat_db_tmpname)
+ free(pgstat_stat_db_tmpname);
+ pgstat_stat_db_tmpname = tname_db;
+ if (pgstat_stat_db_filename)
+ free(pgstat_stat_db_filename);
+ pgstat_stat_db_filename = fname_db;
}
static bool
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3e05ac3..8c86301 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -179,6 +179,7 @@ char *restrict_env;
#endif
const char *subdirs[] = {
"global",
+ "global/stat",
"pg_xlog",
"pg_xlog/archive_status",
"pg_clog",
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 613c1c2..b3467d2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB (InvalidOid => all DBs) */
} PgStat_MsgInquiry;
@@ -514,7 +515,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9A
+#define PGSTAT_FILE_FORMAT_ID 0xA240CA47
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -545,6 +546,7 @@ typedef struct PgStat_StatDBEntry
PgStat_Counter n_block_write_time;
TimestampTz stat_reset_timestamp;
+ TimestampTz stats_timestamp; /* time of db stats file update */
/*
* tables and functions must be last in the struct, because we don't write
@@ -722,8 +724,11 @@ extern bool pgstat_track_activities;
extern bool pgstat_track_counts;
extern int pgstat_track_functions;
extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern char *pgstat_stat_directory;
extern char *pgstat_stat_tmpname;
extern char *pgstat_stat_filename;
+extern char *pgstat_stat_db_tmpname;
+extern char *pgstat_stat_db_filename;
/*
* BgWriter statistics counters are updated directly by bgwriter and bufmgr
On 03.01.2013 01:15, Tomas Vondra wrote:
2) a new global/stat directory
------------------------------The pgstat.stat file was originally saved into the "global" directory,
but with so many files that would get rather messy so I've created a new
global/stat directory and all the files are stored there.This also means we can do a simple "delete files in the dir" when
pgstat_reset_all is called.
How about creating the new directory as a direct subdir of $PGDATA,
rather than buried in global? "global" is supposed to contain data
related to shared catalog relations (plus pg_control), so it doesn't
seem like the right location for per-database stat files. Also, if we're
going to have admins manually zapping the directory (hopefully when the
system is offline), that's less scary if the directory is not buried as
deep.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3.1.2013 18:47, Heikki Linnakangas wrote:
On 03.01.2013 01:15, Tomas Vondra wrote:
2) a new global/stat directory
------------------------------The pgstat.stat file was originally saved into the "global" directory,
but with so many files that would get rather messy so I've created a new
global/stat directory and all the files are stored there.This also means we can do a simple "delete files in the dir" when
pgstat_reset_all is called.How about creating the new directory as a direct subdir of $PGDATA,
rather than buried in global? "global" is supposed to contain data
related to shared catalog relations (plus pg_control), so it doesn't
seem like the right location for per-database stat files. Also, if we're
going to have admins manually zapping the directory (hopefully when the
system is offline), that's less scary if the directory is not buried as
deep.
That's clearly possible and it's a trivial change. I was thinking about
that actually, but then I placed the directory into "global" because
that's where the "pgstat.stat" originally was.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jan 3, 2013 at 8:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 3.1.2013 18:47, Heikki Linnakangas wrote:
On 03.01.2013 01:15, Tomas Vondra wrote:
2) a new global/stat directory
------------------------------The pgstat.stat file was originally saved into the "global" directory,
but with so many files that would get rather messy so I've created a new
global/stat directory and all the files are stored there.This also means we can do a simple "delete files in the dir" when
pgstat_reset_all is called.How about creating the new directory as a direct subdir of $PGDATA,
rather than buried in global? "global" is supposed to contain data
related to shared catalog relations (plus pg_control), so it doesn't
seem like the right location for per-database stat files. Also, if we're
going to have admins manually zapping the directory (hopefully when the
system is offline), that's less scary if the directory is not buried as
deep.That's clearly possible and it's a trivial change. I was thinking about
that actually, but then I placed the directory into "global" because
that's where the "pgstat.stat" originally was.
Yeah, +1 for a separate directory not in global.
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3.1.2013 20:33, Magnus Hagander wrote:
On Thu, Jan 3, 2013 at 8:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 3.1.2013 18:47, Heikki Linnakangas wrote:
How about creating the new directory as a direct subdir of $PGDATA,
rather than buried in global? "global" is supposed to contain data
related to shared catalog relations (plus pg_control), so it doesn't
seem like the right location for per-database stat files. Also, if we're
going to have admins manually zapping the directory (hopefully when the
system is offline), that's less scary if the directory is not buried as
deep.That's clearly possible and it's a trivial change. I was thinking about
that actually, but then I placed the directory into "global" because
that's where the "pgstat.stat" originally was.Yeah, +1 for a separate directory not in global.
OK, I moved the files from "global/stat" to "stat".
Tomas
Attachments:
stats-split-v5.patchtext/plain; charset=UTF-8; name=stats-split-v5.patchDownload
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index be3adf1..4ec485e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -64,10 +64,14 @@
/* ----------
* Paths for the statistics files (relative to installation's $PGDATA).
+ * Permanent and temprorary, global and per-database files.
* ----------
*/
-#define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp"
+#define PGSTAT_STAT_PERMANENT_DIRECTORY "stat"
+#define PGSTAT_STAT_PERMANENT_FILENAME "stat/global.stat"
+#define PGSTAT_STAT_PERMANENT_TMPFILE "stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_DB_FILENAME "stat/%d.stat"
+#define PGSTAT_STAT_PERMANENT_DB_TMPFILE "stat/%d.tmp"
/* ----------
* Timer definitions.
@@ -115,8 +119,11 @@ int pgstat_track_activity_query_size = 1024;
* Built from GUC parameter
* ----------
*/
+char *pgstat_stat_directory = NULL;
char *pgstat_stat_filename = NULL;
char *pgstat_stat_tmpname = NULL;
+char *pgstat_stat_db_filename = NULL;
+char *pgstat_stat_db_tmpname = NULL;
/*
* BgWriter global statistics counters (unused in other processes).
@@ -219,11 +226,16 @@ static int localNumBackends = 0;
*/
static PgStat_GlobalStats globalStats;
-/* Last time the collector successfully wrote the stats file */
-static TimestampTz last_statwrite;
+/* Write request info for each database */
+typedef struct DBWriteRequest
+{
+ Oid databaseid; /* OID of the database to write */
+ TimestampTz request_time; /* timestamp of the last write request */
+} DBWriteRequest;
-/* Latest statistics request time from backends */
-static TimestampTz last_statrequest;
+/* Latest statistics request time from backends for each DB */
+static DBWriteRequest * last_statrequests = NULL;
+static int num_statrequests = 0;
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
@@ -252,11 +264,17 @@ static void pgstat_sighup_handler(SIGNAL_ARGS);
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
-static void pgstat_write_statsfile(bool permanent);
-static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
+static void pgstat_write_statsfile(bool permanent, bool force);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent);
+static void pgstat_write_db_dummyfile(Oid databaseid);
+static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs);
+static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
+static bool pgstat_write_statsfile_needed();
+static bool pgstat_db_requested(Oid databaseid);
+
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
static HTAB *pgstat_collect_oids(Oid catalogid);
@@ -285,7 +303,6 @@ static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int le
static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
@@ -549,8 +566,34 @@ startup_failed:
void
pgstat_reset_all(void)
{
- unlink(pgstat_stat_filename);
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ DIR * dir;
+ struct dirent * entry;
+
+ dir = AllocateDir(pgstat_stat_directory);
+ while ((entry = ReadDir(dir, pgstat_stat_directory)) != NULL)
+ {
+ char fname[strlen(pgstat_stat_directory) + strlen(entry->d_name) + 1];
+
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ sprintf(fname, "%s/%s", pgstat_stat_directory, entry->d_name);
+ unlink(fname);
+ }
+ FreeDir(dir);
+
+ dir = AllocateDir(PGSTAT_STAT_PERMANENT_DIRECTORY);
+ while ((entry = ReadDir(dir, PGSTAT_STAT_PERMANENT_DIRECTORY)) != NULL)
+ {
+ char fname[strlen(PGSTAT_STAT_PERMANENT_FILENAME) + strlen(entry->d_name) + 1];
+
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ sprintf(fname, "%s/%s", PGSTAT_STAT_PERMANENT_FILENAME, entry->d_name);
+ unlink(fname);
+ }
+ FreeDir(dir);
}
#ifdef EXEC_BACKEND
@@ -1408,13 +1451,14 @@ pgstat_ping(void)
* ----------
*/
static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
+pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
@@ -3004,6 +3048,7 @@ PgstatCollectorMain(int argc, char *argv[])
int len;
PgStat_Msg msg;
int wr;
+ bool first_write = true;
IsUnderPostmaster = true; /* we are a postmaster subprocess now */
@@ -3053,17 +3098,11 @@ PgstatCollectorMain(int argc, char *argv[])
init_ps_display("stats collector process", "", "", "");
/*
- * Arrange to write the initial status file right away
- */
- last_statrequest = GetCurrentTimestamp();
- last_statwrite = last_statrequest - 1;
-
- /*
* Read in an existing statistics stats file or initialize the stats to
- * zero.
+ * zero (read data for all databases, including table/func stats).
*/
pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, false);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -3107,10 +3146,14 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Write the stats file if a new request has arrived that is not
- * satisfied by existing file.
+ * satisfied by existing file (force writing all files if it's
+ * the first write after startup).
*/
- if (last_statwrite < last_statrequest)
- pgstat_write_statsfile(false);
+ if (first_write || pgstat_write_statsfile_needed())
+ {
+ pgstat_write_statsfile(false, first_write);
+ first_write = false;
+ }
/*
* Try to receive and process a message. This will not block,
@@ -3269,7 +3312,7 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Save the final stats to reuse at next startup.
*/
- pgstat_write_statsfile(true);
+ pgstat_write_statsfile(true, true);
exit(0);
}
@@ -3429,23 +3472,25 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
* shutting down only), remove the temporary file so that backends
* starting up under a new postmaster can't read the old data before
* the new collector is ready.
+ *
+ * When the 'force' is false, only the requested databases (listed in
+ * last_statrequests) will be written. If 'force' is true, all databases
+ * will be written (this is used e.g. at shutdown).
* ----------
*/
static void
-pgstat_write_statsfile(bool permanent)
+pgstat_write_statsfile(bool permanent, bool force)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
/*
* Open the statistics temp file to write out the current values.
*/
@@ -3484,6 +3529,20 @@ pgstat_write_statsfile(bool permanent)
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
{
/*
+ * Write our the tables and functions into a separate file, but only
+ * if the database is in the requests or if it's a forced write (then
+ * all the DBs need to be written - e.g. at the shutdown).
+ *
+ * We need to do this before the dbentry write to write the proper
+ * timestamp to the global file.
+ */
+ if (force || pgstat_db_requested(dbentry->databaseid)) {
+ elog(DEBUG1, "writing statsfile for DB %d", dbentry->databaseid);
+ dbentry->stats_timestamp = globalStats.stats_timestamp;
+ pgstat_write_db_statsfile(dbentry, permanent);
+ }
+
+ /*
* Write out the DB entry including the number of live backends. We
* don't write the tables or functions pointers, since they're of no
* use to any other process.
@@ -3493,29 +3552,10 @@ pgstat_write_statsfile(bool permanent)
(void) rc; /* we'll check for error with ferror */
/*
- * Walk through the database's access stats per table.
- */
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
- {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
* Mark the end of this DB
+ *
+ * TODO Does using these chars still make sense, when the tables/func
+ * stats are moved to a separate file?
*/
fputc('d', fpout);
}
@@ -3527,6 +3567,28 @@ pgstat_write_statsfile(bool permanent)
*/
fputc('E', fpout);
+ /* In any case, we can just throw away all the db requests, but we need to
+ * write dummy files for databases without a stat entry (it would cause
+ * issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts).
+ * This may happend e.g. for shared DB (oid = 0) right after initdb.
+ */
+ if (last_statrequests != NULL)
+ {
+ int i = 0;
+ for (i = 0; i < num_statrequests; i++)
+ {
+ /* Create dummy files for requested databases without a proper
+ * dbentry. It's much easier this way than dealing with multiple
+ * timestamps, possibly existing but not yet written DBs etc. */
+ if (! pgstat_get_db_entry(last_statrequests[i].databaseid, false))
+ pgstat_write_db_dummyfile(last_statrequests[i].databaseid);
+ }
+
+ pfree(last_statrequests);
+ last_statrequests = NULL;
+ num_statrequests = 0;
+ }
+
if (ferror(fpout))
{
ereport(LOG,
@@ -3552,57 +3614,247 @@ pgstat_write_statsfile(bool permanent)
tmpfile, statfile)));
unlink(tmpfile);
}
- else
+
+ if (permanent)
+ unlink(pgstat_stat_filename);
+}
+
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *
+ * Tell the news. This writes stats file for a single database.
+ *
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+{
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int32 format_id;
+ const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_DB_TMPFILE : pgstat_stat_db_tmpname;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+ int rc;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_tmpfile[strlen(tmpfile) + 11];
+ char db_statfile[strlen(statfile) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_tmpfile, strlen(tmpfile) + 11, tmpfile, dbentry->databaseid);
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, dbentry->databaseid);
+
+ elog(DEBUG1, "writing statsfile '%s'", db_statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(db_tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
{
- /*
- * Successful write, so update last_statwrite.
- */
- last_statwrite = globalStats.stats_timestamp;
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ return;
+ }
- /*
- * If there is clock skew between backends and the collector, we could
- * receive a stats request time that's in the future. If so, complain
- * and reset last_statrequest. Resetting ensures that no inquiry
- * message can cause more than one stats file write to occur.
- */
- if (last_statrequest > last_statwrite)
- {
- char *reqtime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- reqtime = pstrdup(timestamptz_to_str(last_statrequest));
- mytime = pstrdup(timestamptz_to_str(last_statwrite));
- elog(LOG, "last_statrequest %s is later than collector's time %s",
- reqtime, mytime);
- pfree(reqtime);
- pfree(mytime);
-
- last_statrequest = last_statwrite;
- }
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
}
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ FreeFile(fpout);
+ unlink(db_tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ unlink(db_tmpfile);
+ }
+ else if (rename(db_tmpfile, db_statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+ db_tmpfile, db_statfile)));
+ unlink(db_tmpfile);
+ }
+
if (permanent)
- unlink(pgstat_stat_filename);
+ {
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11,
+ pgstat_stat_db_filename, dbentry->databaseid);
+ elog(DEBUG1, "removing temporary stat file '%s'", db_statfile);
+ unlink(db_statfile);
+ }
}
/* ----------
+ * pgstat_write_db_dummyfile() -
+ *
+ * All this does is writing a dummy stat file for databases without dbentry
+ * yet. It basically writes just a file header - format ID and a timestamp.
+ * ----------
+ */
+static void
+pgstat_write_db_dummyfile(Oid databaseid)
+{
+ FILE *fpout;
+ int32 format_id;
+ int rc;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_tmpfile[strlen(pgstat_stat_db_tmpname) + 11];
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_tmpfile, strlen(pgstat_stat_db_tmpname) + 11, pgstat_stat_db_tmpname, databaseid);
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11, pgstat_stat_db_filename, databaseid);
+
+ elog(DEBUG1, "writing statsfile '%s'", db_statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(db_tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ return;
+ }
+
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary dummy statistics file \"%s\": %m",
+ db_tmpfile)));
+ FreeFile(fpout);
+ unlink(db_tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary dummy statistics file \"%s\": %m",
+ db_tmpfile)));
+ unlink(db_tmpfile);
+ }
+ else if (rename(db_tmpfile, db_statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary dummy statistics file \"%s\" to \"%s\": %m",
+ db_tmpfile, db_statfile)));
+ unlink(db_tmpfile);
+ }
+
+}
+
+/* ----------
* pgstat_read_statsfile() -
*
* Reads in an existing statistics collector file and initializes the
* databases' hash table (whose entries point to the tables' hash tables).
+ *
+ * Allows reading only the global stats (at database level), which is just
+ * enough for many purposes (e.g. autovacuum launcher etc.). If this is
+ * sufficient for you, use onlydbs=true.
* ----------
*/
static HTAB *
-pgstat_read_statsfile(Oid onlydb, bool permanent)
+pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
HTAB *tabhash = NULL;
@@ -3613,6 +3865,11 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
/*
+ * If we want a db-level stats only, we don't want a particular db.
+ */
+ Assert(!((onlydb != InvalidOid) && onlydbs));
+
+ /*
* The tables will live in pgStatLocalContext.
*/
pgstat_setup_memcxt();
@@ -3758,6 +4015,16 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
*/
tabhash = dbentry->tables;
funchash = dbentry->functions;
+
+ /*
+ * Read the data from the file for this database. If there was
+ * onlydb specified (!= InvalidOid), we would not get here because
+ * of a break above. So we don't need to recheck.
+ */
+ if (! onlydbs)
+ pgstat_read_db_statsfile(dbentry->databaseid, tabhash, funchash,
+ permanent);
+
break;
/*
@@ -3768,6 +4035,105 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
funchash = NULL;
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+done:
+ FreeFile(fpin);
+
+ if (permanent)
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+
+ return dbhash;
+}
+
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent)
+{
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ int32 format_id;
+ TimestampTz timestamp;
+ bool found;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_statfile[strlen(statfile) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, databaseid);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(db_statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file \"%s\": %m",
+ db_statfile)));
+ return;
+ }
+
+ /*
+ * Verify it's of the expected format.
+ */
+ if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
+ || format_id != PGSTAT_FILE_FORMAT_ID)
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
+ goto done;
+ }
+
+ /*
+ * Read global stats struct
+ */
+ if (fread(×tamp, 1, sizeof(timestamp), fpin) != sizeof(timestamp))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
+ goto done;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
@@ -3777,7 +4143,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3795,7 +4161,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3811,7 +4177,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3829,7 +4195,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3845,7 +4211,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
default:
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
}
@@ -3854,37 +4220,47 @@ done:
FreeFile(fpin);
if (permanent)
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ {
+ char db_statfile[strlen(PGSTAT_STAT_PERMANENT_DB_FILENAME) + 11];
+ snprintf(db_statfile, strlen(PGSTAT_STAT_PERMANENT_DB_FILENAME) + 11,
+ PGSTAT_STAT_PERMANENT_DB_FILENAME, databaseid);
+ elog(DEBUG1, "removing permanent stats file '%s'", db_statfile);
+ unlink(db_statfile);
+ }
- return dbhash;
+ return;
}
/* ----------
- * pgstat_read_statsfile_timestamp() -
+ * pgstat_read_db_statsfile_timestamp() -
*
- * Attempt to fetch the timestamp of an existing stats file.
+ * Attempt to fetch the timestamp of an existing stats file (for a DB).
* Returns TRUE if successful (timestamp is stored at *ts).
* ----------
*/
static bool
-pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
+pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, TimestampTz *ts)
{
- PgStat_GlobalStats myGlobalStats;
+ TimestampTz timestamp;
FILE *fpin;
int32 format_id;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+ char db_statfile[strlen(statfile) + 11];
+
+ /* format the db statfile filename */
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, databaseid);
/*
* Try to open the status file. As above, anything but ENOENT is worthy
* of complaining about.
*/
- if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ if ((fpin = AllocateFile(db_statfile, PG_BINARY_R)) == NULL)
{
if (errno != ENOENT)
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errcode_for_file_access(),
errmsg("could not open statistics file \"%s\": %m",
- statfile)));
+ db_statfile)));
return false;
}
@@ -3895,7 +4271,7 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
|| format_id != PGSTAT_FILE_FORMAT_ID)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
FreeFile(fpin);
return false;
}
@@ -3903,15 +4279,15 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
/*
* Read global stats struct
*/
- if (fread(&myGlobalStats, 1, sizeof(myGlobalStats), fpin) != sizeof(myGlobalStats))
+ if (fread(×tamp, 1, sizeof(TimestampTz), fpin) != sizeof(TimestampTz))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
FreeFile(fpin);
return false;
}
- *ts = myGlobalStats.stats_timestamp;
+ *ts = timestamp;
FreeFile(fpin);
return true;
@@ -3947,7 +4323,7 @@ backend_read_statsfile(void)
CHECK_FOR_INTERRUPTS();
- ok = pgstat_read_statsfile_timestamp(false, &file_ts);
+ ok = pgstat_read_db_statsfile_timestamp(MyDatabaseId, false, &file_ts);
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
@@ -4006,7 +4382,7 @@ backend_read_statsfile(void)
pfree(mytime);
}
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
break;
}
@@ -4016,7 +4392,7 @@ backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
@@ -4026,9 +4402,16 @@ backend_read_statsfile(void)
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
+ /*
+ * FIXME Does it really need info including tables/functions? Or is it enough to read
+ * database-level stats? It seems to me the launcher needs PgStat_StatDBEntry only
+ * (at least that's how I understand the rebuild_database_list() in autovacuum.c),
+ * because pgstat_stattabentries are used in do_autovacuum() only, that that's what's
+ * executed in workers ... So maybe we'd be just fine by reading in the dbentries?
+ */
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, true);
else
- pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
+ pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false, false);
}
@@ -4084,44 +4467,84 @@ pgstat_clear_snapshot(void)
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
- /*
- * Advance last_statrequest if this requestor has a newer cutoff time
- * than any previous request.
- */
- if (msg->cutoff_time > last_statrequest)
- last_statrequest = msg->cutoff_time;
+ int i = 0;
+ bool found = false;
+ PgStat_StatDBEntry *dbentry;
+
+ elog(DEBUG1, "received inquiry for %d", msg->databaseid);
/*
- * If the requestor's local clock time is older than last_statwrite, we
- * should suspect a clock glitch, ie system time going backwards; though
- * the more likely explanation is just delayed message receipt. It is
- * worth expending a GetCurrentTimestamp call to be sure, since a large
- * retreat in the system clock reading could otherwise cause us to neglect
- * to update the stats file for a long time.
+ * Find the last write request for this DB (found=true in that case). Plain
+ * linear search, not really worth doing any magic here (probably).
*/
- if (msg->clock_time < last_statwrite)
+ for (i = 0; i < num_statrequests; i++)
+ {
+ if (last_statrequests[i].databaseid == msg->databaseid)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ /*
+ * There already is a request for this DB, so lets advance the
+ * request time if this requestor has a newer cutoff time
+ * than any previous request.
+ */
+ if (msg->cutoff_time > last_statrequests[i].request_time)
+ last_statrequests[i].request_time = msg->cutoff_time;
+ }
+ else
{
- TimestampTz cur_ts = GetCurrentTimestamp();
+ /*
+ * There's no request for this DB yet, so lets create it (allocate a
+ * space for it, set the values).
+ */
+ if (last_statrequests == NULL)
+ last_statrequests = palloc(sizeof(DBWriteRequest));
+ else
+ last_statrequests = repalloc(last_statrequests,
+ (num_statrequests + 1)*sizeof(DBWriteRequest));
+
+ last_statrequests[num_statrequests].databaseid = msg->databaseid;
+ last_statrequests[num_statrequests].request_time = msg->clock_time;
+ num_statrequests += 1;
- if (cur_ts < last_statwrite)
+ /*
+ * If the requestor's local clock time is older than last_statwrite, we
+ * should suspect a clock glitch, ie system time going backwards; though
+ * the more likely explanation is just delayed message receipt. It is
+ * worth expending a GetCurrentTimestamp call to be sure, since a large
+ * retreat in the system clock reading could otherwise cause us to neglect
+ * to update the stats file for a long time.
+ */
+ dbentry = pgstat_get_db_entry(msg->databaseid, false);
+ if ((dbentry != NULL) && (msg->clock_time < dbentry->stats_timestamp))
{
- /*
- * Sure enough, time went backwards. Force a new stats file write
- * to get back in sync; but first, log a complaint.
- */
- char *writetime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- writetime = pstrdup(timestamptz_to_str(last_statwrite));
- mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG, "last_statwrite %s is later than collector's time %s",
- writetime, mytime);
- pfree(writetime);
- pfree(mytime);
-
- last_statrequest = cur_ts;
- last_statwrite = last_statrequest - 1;
+ TimestampTz cur_ts = GetCurrentTimestamp();
+
+ if (cur_ts < dbentry->stats_timestamp)
+ {
+ /*
+ * Sure enough, time went backwards. Force a new stats file write
+ * to get back in sync; but first, log a complaint.
+ */
+ char *writetime;
+ char *mytime;
+
+ /* Copy because timestamptz_to_str returns a static buffer */
+ writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
+ mytime = pstrdup(timestamptz_to_str(cur_ts));
+ elog(LOG, "last_statwrite %s is later than collector's time %s for "
+ "db %d", writetime, mytime, dbentry->databaseid);
+ pfree(writetime);
+ pfree(mytime);
+
+ last_statrequests[num_statrequests].request_time = cur_ts;
+ dbentry->stats_timestamp = cur_ts - 1;
+ }
}
}
}
@@ -4278,10 +4701,17 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
/*
- * If found, remove it.
+ * If found, remove it (along with the db statfile).
*/
if (dbentry)
{
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11,
+ pgstat_stat_filename, dbentry->databaseid);
+
+ elog(DEBUG1, "removing %s", db_statfile);
+ unlink(db_statfile);
+
if (dbentry->tables != NULL)
hash_destroy(dbentry->tables);
if (dbentry->functions != NULL)
@@ -4687,3 +5117,58 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
HASH_REMOVE, NULL);
}
}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether there's a db stats request, requiring a file write.
+ *
+ * TODO Seems that thanks the way we handle last_statrequests (erase after
+ * a write), this is unnecessary. Just check that there's at least one
+ * request and you're done. Although there might be delayed requests ...
+ * ----------
+ */
+
+static bool pgstat_write_statsfile_needed()
+{
+ int i = 0;
+ PgStat_StatDBEntry *dbentry;
+
+ /* Check the databases if they need to refresh the stats. */
+ for (i = 0; i < num_statrequests; i++)
+ {
+ dbentry = pgstat_get_db_entry(last_statrequests[i].databaseid, false);
+
+ /* No dbentry yet or too old. */
+ if ((! dbentry) ||
+ (dbentry->stats_timestamp < last_statrequests[i].request_time)) {
+ return true;
+ }
+
+ }
+
+ /* Well, everything was written recently ... */
+ return false;
+}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether stats for a particular DB need to be written to a file).
+ * ----------
+ */
+
+static bool
+pgstat_db_requested(Oid databaseid)
+{
+ int i = 0;
+
+ /* Check the databases if they need to refresh the stats. */
+ for (i = 0; i < num_statrequests; i++)
+ {
+ if (last_statrequests[i].databaseid == databaseid)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2cf34ce..e3e432b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -8730,20 +8730,43 @@ static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
/* check_canonical_path already canonicalized newval for us */
+ char *dname;
char *tname;
char *fname;
-
- tname = guc_malloc(ERROR, strlen(newval) + 12); /* /pgstat.tmp */
- sprintf(tname, "%s/pgstat.tmp", newval);
- fname = guc_malloc(ERROR, strlen(newval) + 13); /* /pgstat.stat */
- sprintf(fname, "%s/pgstat.stat", newval);
-
+ char *tname_db;
+ char *fname_db;
+
+ /* directory */
+ dname = guc_malloc(ERROR, strlen(newval) + 1); /* runtime dir */
+ sprintf(dname, "%s", newval);
+
+ /* global stats */
+ tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
+ sprintf(tname, "%s/global.tmp", newval);
+ fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
+ sprintf(fname, "%s/global.stat", newval);
+
+ /* per-db stats */
+ tname_db = guc_malloc(ERROR, strlen(newval) + 8); /* /%d.tmp */
+ sprintf(tname_db, "%s/%%d.tmp", newval);
+ fname_db = guc_malloc(ERROR, strlen(newval) + 9); /* /%d.stat */
+ sprintf(fname_db, "%s/%%d.stat", newval);
+
+ if (pgstat_stat_directory)
+ free(pgstat_stat_directory);
+ pgstat_stat_directory = dname;
if (pgstat_stat_tmpname)
free(pgstat_stat_tmpname);
pgstat_stat_tmpname = tname;
if (pgstat_stat_filename)
free(pgstat_stat_filename);
pgstat_stat_filename = fname;
+ if (pgstat_stat_db_tmpname)
+ free(pgstat_stat_db_tmpname);
+ pgstat_stat_db_tmpname = tname_db;
+ if (pgstat_stat_db_filename)
+ free(pgstat_stat_db_filename);
+ pgstat_stat_db_filename = fname_db;
}
static bool
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3e05ac3..a8a2639 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -179,6 +179,7 @@ char *restrict_env;
#endif
const char *subdirs[] = {
"global",
+ "stat",
"pg_xlog",
"pg_xlog/archive_status",
"pg_clog",
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 613c1c2..b3467d2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB (InvalidOid => all DBs) */
} PgStat_MsgInquiry;
@@ -514,7 +515,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9A
+#define PGSTAT_FILE_FORMAT_ID 0xA240CA47
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -545,6 +546,7 @@ typedef struct PgStat_StatDBEntry
PgStat_Counter n_block_write_time;
TimestampTz stat_reset_timestamp;
+ TimestampTz stats_timestamp; /* time of db stats file update */
/*
* tables and functions must be last in the struct, because we don't write
@@ -722,8 +724,11 @@ extern bool pgstat_track_activities;
extern bool pgstat_track_counts;
extern int pgstat_track_functions;
extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern char *pgstat_stat_directory;
extern char *pgstat_stat_tmpname;
extern char *pgstat_stat_filename;
+extern char *pgstat_stat_db_tmpname;
+extern char *pgstat_stat_db_filename;
/*
* BgWriter statistics counters are updated directly by bgwriter and bufmgr
Tomas Vondra wrote:
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index be3adf1..4ec485e 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -64,10 +64,14 @@/* ----------
* Paths for the statistics files (relative to installation's $PGDATA).
+ * Permanent and temprorary, global and per-database files.
Note typo in the line above.
-#define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat" -#define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp" +#define PGSTAT_STAT_PERMANENT_DIRECTORY "stat" +#define PGSTAT_STAT_PERMANENT_FILENAME "stat/global.stat" +#define PGSTAT_STAT_PERMANENT_TMPFILE "stat/global.tmp" +#define PGSTAT_STAT_PERMANENT_DB_FILENAME "stat/%d.stat" +#define PGSTAT_STAT_PERMANENT_DB_TMPFILE "stat/%d.tmp"
+char *pgstat_stat_directory = NULL; char *pgstat_stat_filename = NULL; char *pgstat_stat_tmpname = NULL; +char *pgstat_stat_db_filename = NULL; +char *pgstat_stat_db_tmpname = NULL;
I don't like the quoted parts very much; it seems awkward to have the
snprintf patterns in one place and have them be used in very distant
places. Is there a way to improve that? Also, if I understand clearly,
the pgstat_stat_db_filename value needs to be an snprintf pattern too,
right? What if it doesn't contain the required % specifier?
Also, if you can filter this through pgindent, that would be best. Make
sure to add DBWriteRequest to src/tools/pgindent/typedefs_list.
+ /* + * There's no request for this DB yet, so lets create it (allocate a + * space for it, set the values). + */ + if (last_statrequests == NULL) + last_statrequests = palloc(sizeof(DBWriteRequest)); + else + last_statrequests = repalloc(last_statrequests, + (num_statrequests + 1)*sizeof(DBWriteRequest)); + + last_statrequests[num_statrequests].databaseid = msg->databaseid; + last_statrequests[num_statrequests].request_time = msg->clock_time; + num_statrequests += 1;
Having to repalloc this array each time seems wrong. Would a list
instead of an array help? see ilist.c/h; I vote for a dlist because you
can easily delete elements from the middle of it, if required (I think
you'd need that.)
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11]; + snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11, + pgstat_stat_filename, dbentry->databaseid);
This pattern seems rather frequent. Can we use a macro or similar here?
Encapsulating the "11" better would be good. Magic numbers are evil.
diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 613c1c2..b3467d2 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry PgStat_MsgHdr m_hdr; TimestampTz clock_time; /* observed local clock time */ TimestampTz cutoff_time; /* minimum acceptable file timestamp */ + Oid databaseid; /* requested DB (InvalidOid => all DBs) */ } PgStat_MsgInquiry;
Do we need to support the case that somebody requests stuff from the
"shared" DB? IIRC that's what InvalidOid means in pgstat ...
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 3.1.2013 20:33, Magnus Hagander wrote:
Yeah, +1 for a separate directory not in global.
OK, I moved the files from "global/stat" to "stat".
This has a warning:
pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with
no prototype before its definition
I plan to do some performance testing, but that will take a while so I
wanted to post this before I get distracted.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 3.1.2013 20:33, Magnus Hagander wrote:
Yeah, +1 for a separate directory not in global.
OK, I moved the files from "global/stat" to "stat".
Why "stat" rather than "pg_stat"?
The existence of "global" and "base" as exceptions already annoys me.
(Especially when I do a tar -xf in my home directory without
remembering the -C flag). Unless there is some unstated rule behind
what gets a pg_ and what doesn't, I think we should have the "pg_".
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Feb 2, 2013 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 3.1.2013 20:33, Magnus Hagander wrote:
Yeah, +1 for a separate directory not in global.
OK, I moved the files from "global/stat" to "stat".
This has a warning:
pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with
no prototype before its definitionI plan to do some performance testing, but that will take a while so I
wanted to post this before I get distracted.
Running "vacuumdb -a" on a cluster with 1000 db with 200 tables (x
serial primary key) in each, I get log messages like this:
last_statwrite 23682-06-18 22:36:52.960194-07 is later than
collector's time 2013-02-03 12:49:19.700629-08 for db 16387
Note the bizarre year in the first time stamp.
If it matters, I got this after shutting down the cluster, blowing
away $DATA/stat/*, then restarting it and invoking vacuumdb.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3.2.2013 20:46, Jeff Janes wrote:
On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 3.1.2013 20:33, Magnus Hagander wrote:
Yeah, +1 for a separate directory not in global.
OK, I moved the files from "global/stat" to "stat".
Why "stat" rather than "pg_stat"?
The existence of "global" and "base" as exceptions already annoys me.
(Especially when I do a tar -xf in my home directory without
remembering the -C flag). Unless there is some unstated rule behind
what gets a pg_ and what doesn't, I think we should have the "pg_".
I don't think there's a clear naming rule. But I think your suggestion
makes perfect sense, especially because we have pg_stat_tmp directory.
So now we'd have pg_stat and pg_stat_tmp, which is quite elegant.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2.2.2013 23:33, Jeff Janes wrote:
On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 3.1.2013 20:33, Magnus Hagander wrote:
Yeah, +1 for a separate directory not in global.
OK, I moved the files from "global/stat" to "stat".
This has a warning:
pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with
no prototype before its definition
I forgot to add "void" into the method prototype ... Thanks!
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3.2.2013 21:54, Jeff Janes wrote:
On Sat, Feb 2, 2013 at 2:33 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Jan 5, 2013 at 8:03 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 3.1.2013 20:33, Magnus Hagander wrote:
Yeah, +1 for a separate directory not in global.
OK, I moved the files from "global/stat" to "stat".
This has a warning:
pgstat.c:5132: warning: 'pgstat_write_statsfile_needed' was used with
no prototype before its definitionI plan to do some performance testing, but that will take a while so I
wanted to post this before I get distracted.Running "vacuumdb -a" on a cluster with 1000 db with 200 tables (x
serial primary key) in each, I get log messages like this:last_statwrite 23682-06-18 22:36:52.960194-07 is later than
collector's time 2013-02-03 12:49:19.700629-08 for db 16387Note the bizarre year in the first time stamp.
If it matters, I got this after shutting down the cluster, blowing
away $DATA/stat/*, then restarting it and invoking vacuumdb.
I somehow expected that hash_search zeroes all the fields of a new
entry, but looking at pgstat_get_db_entry that obviously is not the
case. So stats_timestamp (which tracks timestamp of the last write for a
DB) was random - that's where the bizarre year values came from.
I've added a proper initialization (to 0), and now it works as expected.
Although the whole sequence of errors I was getting was this:
LOG: last_statwrite 11133-08-28 19:22:31.711744+02 is later than
collector's time 2013-02-04 00:54:21.113439+01 for db 19093
WARNING: pgstat wait timeout
LOG: last_statwrite 39681-12-23 18:48:48.9093+01 is later than
collector's time 2013-02-04 00:54:31.424681+01 for db 46494
FATAL: could not find block containing chunk 0x2af4a60
LOG: statistics collector process (PID 10063) exited with exit code 1
WARNING: pgstat wait timeout
WARNING: pgstat wait timeout
I'm not entirely sure where the FATAL came from, but it seems it was
somehow related to the issue - it was quite reproducible, although I
don't see how exactly could this happen. There relevant block of code
looks like this:
char *writetime;
char *mytime;
/* Copy because timestamptz_to_str returns a static buffer */
writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
mytime = pstrdup(timestamptz_to_str(cur_ts));
elog(LOG, "last_statwrite %s is later than collector's time %s for "
"db %d", writetime, mytime, dbentry->databaseid);
pfree(writetime);
pfree(mytime);
which seems quite fine to mee. I'm not sure how one of the pfree calls
could fail?
Anyway, attached is a patch that fixes all three issues, i.e.
1) the un-initialized timestamp
2) the "void" ommited from the signature
3) rename to "pg_stat" instead of just "stat"
Tomas
Attachments:
stats-split-v6.patchtext/x-diff; name=stats-split-v6.patchDownload
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d318db9..6d0efe9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -64,10 +64,14 @@
/* ----------
* Paths for the statistics files (relative to installation's $PGDATA).
+ * Permanent and temprorary, global and per-database files.
* ----------
*/
-#define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp"
+#define PGSTAT_STAT_PERMANENT_DIRECTORY "pg_stat"
+#define PGSTAT_STAT_PERMANENT_FILENAME "pg_stat/global.stat"
+#define PGSTAT_STAT_PERMANENT_TMPFILE "pg_stat/global.tmp"
+#define PGSTAT_STAT_PERMANENT_DB_FILENAME "pg_stat/%d.stat"
+#define PGSTAT_STAT_PERMANENT_DB_TMPFILE "pg_stat/%d.tmp"
/* ----------
* Timer definitions.
@@ -115,8 +119,11 @@ int pgstat_track_activity_query_size = 1024;
* Built from GUC parameter
* ----------
*/
+char *pgstat_stat_directory = NULL;
char *pgstat_stat_filename = NULL;
char *pgstat_stat_tmpname = NULL;
+char *pgstat_stat_db_filename = NULL;
+char *pgstat_stat_db_tmpname = NULL;
/*
* BgWriter global statistics counters (unused in other processes).
@@ -219,11 +226,16 @@ static int localNumBackends = 0;
*/
static PgStat_GlobalStats globalStats;
-/* Last time the collector successfully wrote the stats file */
-static TimestampTz last_statwrite;
+/* Write request info for each database */
+typedef struct DBWriteRequest
+{
+ Oid databaseid; /* OID of the database to write */
+ TimestampTz request_time; /* timestamp of the last write request */
+} DBWriteRequest;
-/* Latest statistics request time from backends */
-static TimestampTz last_statrequest;
+/* Latest statistics request time from backends for each DB */
+static DBWriteRequest * last_statrequests = NULL;
+static int num_statrequests = 0;
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
@@ -252,11 +264,17 @@ static void pgstat_sighup_handler(SIGNAL_ARGS);
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
-static void pgstat_write_statsfile(bool permanent);
-static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
+static void pgstat_write_statsfile(bool permanent, bool force);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent);
+static void pgstat_write_db_dummyfile(Oid databaseid);
+static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs);
+static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
+static bool pgstat_write_statsfile_needed(void);
+static bool pgstat_db_requested(Oid databaseid);
+
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
static HTAB *pgstat_collect_oids(Oid catalogid);
@@ -285,7 +303,6 @@ static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int le
static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
@@ -549,8 +566,34 @@ startup_failed:
void
pgstat_reset_all(void)
{
- unlink(pgstat_stat_filename);
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ DIR * dir;
+ struct dirent * entry;
+
+ dir = AllocateDir(pgstat_stat_directory);
+ while ((entry = ReadDir(dir, pgstat_stat_directory)) != NULL)
+ {
+ char fname[strlen(pgstat_stat_directory) + strlen(entry->d_name) + 1];
+
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ sprintf(fname, "%s/%s", pgstat_stat_directory, entry->d_name);
+ unlink(fname);
+ }
+ FreeDir(dir);
+
+ dir = AllocateDir(PGSTAT_STAT_PERMANENT_DIRECTORY);
+ while ((entry = ReadDir(dir, PGSTAT_STAT_PERMANENT_DIRECTORY)) != NULL)
+ {
+ char fname[strlen(PGSTAT_STAT_PERMANENT_FILENAME) + strlen(entry->d_name) + 1];
+
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ sprintf(fname, "%s/%s", PGSTAT_STAT_PERMANENT_FILENAME, entry->d_name);
+ unlink(fname);
+ }
+ FreeDir(dir);
}
#ifdef EXEC_BACKEND
@@ -1408,13 +1451,14 @@ pgstat_ping(void)
* ----------
*/
static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
+pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
@@ -3004,6 +3048,7 @@ PgstatCollectorMain(int argc, char *argv[])
int len;
PgStat_Msg msg;
int wr;
+ bool first_write = true;
IsUnderPostmaster = true; /* we are a postmaster subprocess now */
@@ -3053,17 +3098,11 @@ PgstatCollectorMain(int argc, char *argv[])
init_ps_display("stats collector process", "", "", "");
/*
- * Arrange to write the initial status file right away
- */
- last_statrequest = GetCurrentTimestamp();
- last_statwrite = last_statrequest - 1;
-
- /*
* Read in an existing statistics stats file or initialize the stats to
- * zero.
+ * zero (read data for all databases, including table/func stats).
*/
pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, false);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -3107,10 +3146,14 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Write the stats file if a new request has arrived that is not
- * satisfied by existing file.
+ * satisfied by existing file (force writing all files if it's
+ * the first write after startup).
*/
- if (last_statwrite < last_statrequest)
- pgstat_write_statsfile(false);
+ if (first_write || pgstat_write_statsfile_needed())
+ {
+ pgstat_write_statsfile(false, first_write);
+ first_write = false;
+ }
/*
* Try to receive and process a message. This will not block,
@@ -3269,7 +3312,7 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Save the final stats to reuse at next startup.
*/
- pgstat_write_statsfile(true);
+ pgstat_write_statsfile(true, true);
exit(0);
}
@@ -3349,6 +3392,7 @@ pgstat_get_db_entry(Oid databaseid, bool create)
result->n_block_write_time = 0;
result->stat_reset_timestamp = GetCurrentTimestamp();
+ result->stats_timestamp = 0;
memset(&hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(Oid);
@@ -3429,23 +3473,25 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
* shutting down only), remove the temporary file so that backends
* starting up under a new postmaster can't read the old data before
* the new collector is ready.
+ *
+ * When the 'force' is false, only the requested databases (listed in
+ * last_statrequests) will be written. If 'force' is true, all databases
+ * will be written (this is used e.g. at shutdown).
* ----------
*/
static void
-pgstat_write_statsfile(bool permanent)
+pgstat_write_statsfile(bool permanent, bool force)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
/*
* Open the statistics temp file to write out the current values.
*/
@@ -3484,6 +3530,20 @@ pgstat_write_statsfile(bool permanent)
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
{
/*
+ * Write our the tables and functions into a separate file, but only
+ * if the database is in the requests or if it's a forced write (then
+ * all the DBs need to be written - e.g. at the shutdown).
+ *
+ * We need to do this before the dbentry write to write the proper
+ * timestamp to the global file.
+ */
+ if (force || pgstat_db_requested(dbentry->databaseid)) {
+ elog(DEBUG1, "writing statsfile for DB %d", dbentry->databaseid);
+ dbentry->stats_timestamp = globalStats.stats_timestamp;
+ pgstat_write_db_statsfile(dbentry, permanent);
+ }
+
+ /*
* Write out the DB entry including the number of live backends. We
* don't write the tables or functions pointers, since they're of no
* use to any other process.
@@ -3493,29 +3553,10 @@ pgstat_write_statsfile(bool permanent)
(void) rc; /* we'll check for error with ferror */
/*
- * Walk through the database's access stats per table.
- */
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
- {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
* Mark the end of this DB
+ *
+ * TODO Does using these chars still make sense, when the tables/func
+ * stats are moved to a separate file?
*/
fputc('d', fpout);
}
@@ -3527,6 +3568,28 @@ pgstat_write_statsfile(bool permanent)
*/
fputc('E', fpout);
+ /* In any case, we can just throw away all the db requests, but we need to
+ * write dummy files for databases without a stat entry (it would cause
+ * issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts).
+ * This may happend e.g. for shared DB (oid = 0) right after initdb.
+ */
+ if (last_statrequests != NULL)
+ {
+ int i = 0;
+ for (i = 0; i < num_statrequests; i++)
+ {
+ /* Create dummy files for requested databases without a proper
+ * dbentry. It's much easier this way than dealing with multiple
+ * timestamps, possibly existing but not yet written DBs etc. */
+ if (! pgstat_get_db_entry(last_statrequests[i].databaseid, false))
+ pgstat_write_db_dummyfile(last_statrequests[i].databaseid);
+ }
+
+ pfree(last_statrequests);
+ last_statrequests = NULL;
+ num_statrequests = 0;
+ }
+
if (ferror(fpout))
{
ereport(LOG,
@@ -3552,57 +3615,247 @@ pgstat_write_statsfile(bool permanent)
tmpfile, statfile)));
unlink(tmpfile);
}
- else
+
+ if (permanent)
+ unlink(pgstat_stat_filename);
+}
+
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *
+ * Tell the news. This writes stats file for a single database.
+ *
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+{
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int32 format_id;
+ const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_DB_TMPFILE : pgstat_stat_db_tmpname;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+ int rc;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_tmpfile[strlen(tmpfile) + 11];
+ char db_statfile[strlen(statfile) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_tmpfile, strlen(tmpfile) + 11, tmpfile, dbentry->databaseid);
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, dbentry->databaseid);
+
+ elog(DEBUG1, "writing statsfile '%s'", db_statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(db_tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
{
- /*
- * Successful write, so update last_statwrite.
- */
- last_statwrite = globalStats.stats_timestamp;
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ return;
+ }
- /*
- * If there is clock skew between backends and the collector, we could
- * receive a stats request time that's in the future. If so, complain
- * and reset last_statrequest. Resetting ensures that no inquiry
- * message can cause more than one stats file write to occur.
- */
- if (last_statrequest > last_statwrite)
- {
- char *reqtime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- reqtime = pstrdup(timestamptz_to_str(last_statrequest));
- mytime = pstrdup(timestamptz_to_str(last_statwrite));
- elog(LOG, "last_statrequest %s is later than collector's time %s",
- reqtime, mytime);
- pfree(reqtime);
- pfree(mytime);
-
- last_statrequest = last_statwrite;
- }
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
}
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ FreeFile(fpout);
+ unlink(db_tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ unlink(db_tmpfile);
+ }
+ else if (rename(db_tmpfile, db_statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+ db_tmpfile, db_statfile)));
+ unlink(db_tmpfile);
+ }
+
if (permanent)
- unlink(pgstat_stat_filename);
+ {
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11,
+ pgstat_stat_db_filename, dbentry->databaseid);
+ elog(DEBUG1, "removing temporary stat file '%s'", db_statfile);
+ unlink(db_statfile);
+ }
}
/* ----------
+ * pgstat_write_db_dummyfile() -
+ *
+ * All this does is writing a dummy stat file for databases without dbentry
+ * yet. It basically writes just a file header - format ID and a timestamp.
+ * ----------
+ */
+static void
+pgstat_write_db_dummyfile(Oid databaseid)
+{
+ FILE *fpout;
+ int32 format_id;
+ int rc;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_tmpfile[strlen(pgstat_stat_db_tmpname) + 11];
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_tmpfile, strlen(pgstat_stat_db_tmpname) + 11, pgstat_stat_db_tmpname, databaseid);
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11, pgstat_stat_db_filename, databaseid);
+
+ elog(DEBUG1, "writing statsfile '%s'", db_statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(db_tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ db_tmpfile)));
+ return;
+ }
+
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary dummy statistics file \"%s\": %m",
+ db_tmpfile)));
+ FreeFile(fpout);
+ unlink(db_tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary dummy statistics file \"%s\": %m",
+ db_tmpfile)));
+ unlink(db_tmpfile);
+ }
+ else if (rename(db_tmpfile, db_statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary dummy statistics file \"%s\" to \"%s\": %m",
+ db_tmpfile, db_statfile)));
+ unlink(db_tmpfile);
+ }
+
+}
+
+/* ----------
* pgstat_read_statsfile() -
*
* Reads in an existing statistics collector file and initializes the
* databases' hash table (whose entries point to the tables' hash tables).
+ *
+ * Allows reading only the global stats (at database level), which is just
+ * enough for many purposes (e.g. autovacuum launcher etc.). If this is
+ * sufficient for you, use onlydbs=true.
* ----------
*/
static HTAB *
-pgstat_read_statsfile(Oid onlydb, bool permanent)
+pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
HTAB *tabhash = NULL;
@@ -3613,6 +3866,11 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
/*
+ * If we want a db-level stats only, we don't want a particular db.
+ */
+ Assert(!((onlydb != InvalidOid) && onlydbs));
+
+ /*
* The tables will live in pgStatLocalContext.
*/
pgstat_setup_memcxt();
@@ -3758,6 +4016,16 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
*/
tabhash = dbentry->tables;
funchash = dbentry->functions;
+
+ /*
+ * Read the data from the file for this database. If there was
+ * onlydb specified (!= InvalidOid), we would not get here because
+ * of a break above. So we don't need to recheck.
+ */
+ if (! onlydbs)
+ pgstat_read_db_statsfile(dbentry->databaseid, tabhash, funchash,
+ permanent);
+
break;
/*
@@ -3768,6 +4036,105 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
funchash = NULL;
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+done:
+ FreeFile(fpin);
+
+ if (permanent)
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+
+ return dbhash;
+}
+
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent)
+{
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ int32 format_id;
+ TimestampTz timestamp;
+ bool found;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+
+ /*
+ * OIDs are 32-bit values, so 10 chars should be safe, +1 for the \0 byte
+ */
+ char db_statfile[strlen(statfile) + 11];
+
+ /*
+ * Append database OID at the end of the basic filename (both for tmp and target file).
+ */
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, databaseid);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(db_statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file \"%s\": %m",
+ db_statfile)));
+ return;
+ }
+
+ /*
+ * Verify it's of the expected format.
+ */
+ if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
+ || format_id != PGSTAT_FILE_FORMAT_ID)
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
+ goto done;
+ }
+
+ /*
+ * Read global stats struct
+ */
+ if (fread(×tamp, 1, sizeof(timestamp), fpin) != sizeof(timestamp))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
+ goto done;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
@@ -3777,7 +4144,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3795,7 +4162,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3811,7 +4178,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3829,7 +4196,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
@@ -3845,7 +4212,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
default:
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"",
- statfile)));
+ db_statfile)));
goto done;
}
}
@@ -3854,37 +4221,47 @@ done:
FreeFile(fpin);
if (permanent)
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ {
+ char db_statfile[strlen(PGSTAT_STAT_PERMANENT_DB_FILENAME) + 11];
+ snprintf(db_statfile, strlen(PGSTAT_STAT_PERMANENT_DB_FILENAME) + 11,
+ PGSTAT_STAT_PERMANENT_DB_FILENAME, databaseid);
+ elog(DEBUG1, "removing permanent stats file '%s'", db_statfile);
+ unlink(db_statfile);
+ }
- return dbhash;
+ return;
}
/* ----------
- * pgstat_read_statsfile_timestamp() -
+ * pgstat_read_db_statsfile_timestamp() -
*
- * Attempt to fetch the timestamp of an existing stats file.
+ * Attempt to fetch the timestamp of an existing stats file (for a DB).
* Returns TRUE if successful (timestamp is stored at *ts).
* ----------
*/
static bool
-pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
+pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, TimestampTz *ts)
{
- PgStat_GlobalStats myGlobalStats;
+ TimestampTz timestamp;
FILE *fpin;
int32 format_id;
- const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+ const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_DB_FILENAME : pgstat_stat_db_filename;
+ char db_statfile[strlen(statfile) + 11];
+
+ /* format the db statfile filename */
+ snprintf(db_statfile, strlen(statfile) + 11, statfile, databaseid);
/*
* Try to open the status file. As above, anything but ENOENT is worthy
* of complaining about.
*/
- if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ if ((fpin = AllocateFile(db_statfile, PG_BINARY_R)) == NULL)
{
if (errno != ENOENT)
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errcode_for_file_access(),
errmsg("could not open statistics file \"%s\": %m",
- statfile)));
+ db_statfile)));
return false;
}
@@ -3895,7 +4272,7 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
|| format_id != PGSTAT_FILE_FORMAT_ID)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
FreeFile(fpin);
return false;
}
@@ -3903,15 +4280,15 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
/*
* Read global stats struct
*/
- if (fread(&myGlobalStats, 1, sizeof(myGlobalStats), fpin) != sizeof(myGlobalStats))
+ if (fread(×tamp, 1, sizeof(TimestampTz), fpin) != sizeof(TimestampTz))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
- (errmsg("corrupted statistics file \"%s\"", statfile)));
+ (errmsg("corrupted statistics file \"%s\"", db_statfile)));
FreeFile(fpin);
return false;
}
- *ts = myGlobalStats.stats_timestamp;
+ *ts = timestamp;
FreeFile(fpin);
return true;
@@ -3947,7 +4324,7 @@ backend_read_statsfile(void)
CHECK_FOR_INTERRUPTS();
- ok = pgstat_read_statsfile_timestamp(false, &file_ts);
+ ok = pgstat_read_db_statsfile_timestamp(MyDatabaseId, false, &file_ts);
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
@@ -4006,7 +4383,7 @@ backend_read_statsfile(void)
pfree(mytime);
}
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
break;
}
@@ -4016,7 +4393,7 @@ backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
@@ -4026,9 +4403,16 @@ backend_read_statsfile(void)
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
+ /*
+ * FIXME Does it really need info including tables/functions? Or is it enough to read
+ * database-level stats? It seems to me the launcher needs PgStat_StatDBEntry only
+ * (at least that's how I understand the rebuild_database_list() in autovacuum.c),
+ * because pgstat_stattabentries are used in do_autovacuum() only, that that's what's
+ * executed in workers ... So maybe we'd be just fine by reading in the dbentries?
+ */
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, true);
else
- pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
+ pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false, false);
}
@@ -4084,44 +4468,84 @@ pgstat_clear_snapshot(void)
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
- /*
- * Advance last_statrequest if this requestor has a newer cutoff time
- * than any previous request.
- */
- if (msg->cutoff_time > last_statrequest)
- last_statrequest = msg->cutoff_time;
+ int i = 0;
+ bool found = false;
+ PgStat_StatDBEntry *dbentry;
+
+ elog(DEBUG1, "received inquiry for %d", msg->databaseid);
/*
- * If the requestor's local clock time is older than last_statwrite, we
- * should suspect a clock glitch, ie system time going backwards; though
- * the more likely explanation is just delayed message receipt. It is
- * worth expending a GetCurrentTimestamp call to be sure, since a large
- * retreat in the system clock reading could otherwise cause us to neglect
- * to update the stats file for a long time.
+ * Find the last write request for this DB (found=true in that case). Plain
+ * linear search, not really worth doing any magic here (probably).
*/
- if (msg->clock_time < last_statwrite)
+ for (i = 0; i < num_statrequests; i++)
+ {
+ if (last_statrequests[i].databaseid == msg->databaseid)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ /*
+ * There already is a request for this DB, so lets advance the
+ * request time if this requestor has a newer cutoff time
+ * than any previous request.
+ */
+ if (msg->cutoff_time > last_statrequests[i].request_time)
+ last_statrequests[i].request_time = msg->cutoff_time;
+ }
+ else
{
- TimestampTz cur_ts = GetCurrentTimestamp();
+ /*
+ * There's no request for this DB yet, so lets create it (allocate a
+ * space for it, set the values).
+ */
+ if (last_statrequests == NULL)
+ last_statrequests = palloc(sizeof(DBWriteRequest));
+ else
+ last_statrequests = repalloc(last_statrequests,
+ (num_statrequests + 1)*sizeof(DBWriteRequest));
+
+ last_statrequests[num_statrequests].databaseid = msg->databaseid;
+ last_statrequests[num_statrequests].request_time = msg->clock_time;
+ num_statrequests += 1;
- if (cur_ts < last_statwrite)
+ /*
+ * If the requestor's local clock time is older than last_statwrite, we
+ * should suspect a clock glitch, ie system time going backwards; though
+ * the more likely explanation is just delayed message receipt. It is
+ * worth expending a GetCurrentTimestamp call to be sure, since a large
+ * retreat in the system clock reading could otherwise cause us to neglect
+ * to update the stats file for a long time.
+ */
+ dbentry = pgstat_get_db_entry(msg->databaseid, false);
+ if ((dbentry != NULL) && (msg->clock_time < dbentry->stats_timestamp))
{
- /*
- * Sure enough, time went backwards. Force a new stats file write
- * to get back in sync; but first, log a complaint.
- */
- char *writetime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- writetime = pstrdup(timestamptz_to_str(last_statwrite));
- mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG, "last_statwrite %s is later than collector's time %s",
- writetime, mytime);
- pfree(writetime);
- pfree(mytime);
-
- last_statrequest = cur_ts;
- last_statwrite = last_statrequest - 1;
+ TimestampTz cur_ts = GetCurrentTimestamp();
+
+ if (cur_ts < dbentry->stats_timestamp)
+ {
+ /*
+ * Sure enough, time went backwards. Force a new stats file write
+ * to get back in sync; but first, log a complaint.
+ */
+ char *writetime;
+ char *mytime;
+
+ /* Copy because timestamptz_to_str returns a static buffer */
+ writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
+ mytime = pstrdup(timestamptz_to_str(cur_ts));
+ elog(LOG, "last_statwrite %s is later than collector's time %s for "
+ "db %d", writetime, mytime, dbentry->databaseid);
+ pfree(writetime);
+ pfree(mytime);
+
+ last_statrequests[num_statrequests].request_time = cur_ts;
+ dbentry->stats_timestamp = cur_ts - 1;
+ }
}
}
}
@@ -4278,10 +4702,17 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
/*
- * If found, remove it.
+ * If found, remove it (along with the db statfile).
*/
if (dbentry)
{
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11];
+ snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11,
+ pgstat_stat_filename, dbentry->databaseid);
+
+ elog(DEBUG1, "removing %s", db_statfile);
+ unlink(db_statfile);
+
if (dbentry->tables != NULL)
hash_destroy(dbentry->tables);
if (dbentry->functions != NULL)
@@ -4687,3 +5118,58 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
HASH_REMOVE, NULL);
}
}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether there's a db stats request, requiring a file write.
+ *
+ * TODO Seems that thanks the way we handle last_statrequests (erase after
+ * a write), this is unnecessary. Just check that there's at least one
+ * request and you're done. Although there might be delayed requests ...
+ * ----------
+ */
+
+static bool pgstat_write_statsfile_needed(void)
+{
+ int i = 0;
+ PgStat_StatDBEntry *dbentry;
+
+ /* Check the databases if they need to refresh the stats. */
+ for (i = 0; i < num_statrequests; i++)
+ {
+ dbentry = pgstat_get_db_entry(last_statrequests[i].databaseid, false);
+
+ /* No dbentry yet or too old. */
+ if ((! dbentry) ||
+ (dbentry->stats_timestamp < last_statrequests[i].request_time)) {
+ return true;
+ }
+
+ }
+
+ /* Well, everything was written recently ... */
+ return false;
+}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether stats for a particular DB need to be written to a file).
+ * ----------
+ */
+
+static bool
+pgstat_db_requested(Oid databaseid)
+{
+ int i = 0;
+
+ /* Check the databases if they need to refresh the stats. */
+ for (i = 0; i < num_statrequests; i++)
+ {
+ if (last_statrequests[i].databaseid == databaseid)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b0af9f5..08ef324 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -8709,20 +8709,43 @@ static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
/* check_canonical_path already canonicalized newval for us */
+ char *dname;
char *tname;
char *fname;
-
- tname = guc_malloc(ERROR, strlen(newval) + 12); /* /pgstat.tmp */
- sprintf(tname, "%s/pgstat.tmp", newval);
- fname = guc_malloc(ERROR, strlen(newval) + 13); /* /pgstat.stat */
- sprintf(fname, "%s/pgstat.stat", newval);
-
+ char *tname_db;
+ char *fname_db;
+
+ /* directory */
+ dname = guc_malloc(ERROR, strlen(newval) + 1); /* runtime dir */
+ sprintf(dname, "%s", newval);
+
+ /* global stats */
+ tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
+ sprintf(tname, "%s/global.tmp", newval);
+ fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
+ sprintf(fname, "%s/global.stat", newval);
+
+ /* per-db stats */
+ tname_db = guc_malloc(ERROR, strlen(newval) + 8); /* /%d.tmp */
+ sprintf(tname_db, "%s/%%d.tmp", newval);
+ fname_db = guc_malloc(ERROR, strlen(newval) + 9); /* /%d.stat */
+ sprintf(fname_db, "%s/%%d.stat", newval);
+
+ if (pgstat_stat_directory)
+ free(pgstat_stat_directory);
+ pgstat_stat_directory = dname;
if (pgstat_stat_tmpname)
free(pgstat_stat_tmpname);
pgstat_stat_tmpname = tname;
if (pgstat_stat_filename)
free(pgstat_stat_filename);
pgstat_stat_filename = fname;
+ if (pgstat_stat_db_tmpname)
+ free(pgstat_stat_db_tmpname);
+ pgstat_stat_db_tmpname = tname_db;
+ if (pgstat_stat_db_filename)
+ free(pgstat_stat_db_filename);
+ pgstat_stat_db_filename = fname_db;
}
static bool
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1bba426..da1e19f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -192,6 +192,7 @@ const char *subdirs[] = {
"base",
"base/1",
"pg_tblspc",
+ "pg_stat",
"pg_stat_tmp"
};
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 03c0174..d7d4ad9 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB (InvalidOid => all DBs) */
} PgStat_MsgInquiry;
@@ -514,7 +515,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9A
+#define PGSTAT_FILE_FORMAT_ID 0xA240CA47
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -545,6 +546,7 @@ typedef struct PgStat_StatDBEntry
PgStat_Counter n_block_write_time;
TimestampTz stat_reset_timestamp;
+ TimestampTz stats_timestamp; /* time of db stats file update */
/*
* tables and functions must be last in the struct, because we don't write
@@ -722,8 +724,11 @@ extern bool pgstat_track_activities;
extern bool pgstat_track_counts;
extern int pgstat_track_functions;
extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern char *pgstat_stat_directory;
extern char *pgstat_stat_tmpname;
extern char *pgstat_stat_filename;
+extern char *pgstat_stat_db_tmpname;
+extern char *pgstat_stat_db_filename;
/*
* BgWriter statistics counters are updated directly by bgwriter and bufmgr
On 1.2.2013 17:19, Alvaro Herrera wrote:
Tomas Vondra wrote:
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index be3adf1..4ec485e 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -64,10 +64,14 @@/* ----------
* Paths for the statistics files (relative to installation's $PGDATA).
+ * Permanent and temprorary, global and per-database files.Note typo in the line above.
-#define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat" -#define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp" +#define PGSTAT_STAT_PERMANENT_DIRECTORY "stat" +#define PGSTAT_STAT_PERMANENT_FILENAME "stat/global.stat" +#define PGSTAT_STAT_PERMANENT_TMPFILE "stat/global.tmp" +#define PGSTAT_STAT_PERMANENT_DB_FILENAME "stat/%d.stat" +#define PGSTAT_STAT_PERMANENT_DB_TMPFILE "stat/%d.tmp"+char *pgstat_stat_directory = NULL; char *pgstat_stat_filename = NULL; char *pgstat_stat_tmpname = NULL; +char *pgstat_stat_db_filename = NULL; +char *pgstat_stat_db_tmpname = NULL;I don't like the quoted parts very much; it seems awkward to have the
snprintf patterns in one place and have them be used in very distant
I don't see that as particularly awkward, but that's a matter of taste.
I still see that as a bunch of constants that are sprintf patterns at
the same time.
places. Is there a way to improve that? Also, if I understand clearly,
the pgstat_stat_db_filename value needs to be an snprintf pattern too,
right? What if it doesn't contain the required % specifier?
Ummmm, yes - it needs to be a pattern too, but the user specifies the
directory (stats_temp_directory) and this is used to derive all the
other values - see assign_pgstat_temp_directory() in guc.c.
Also, if you can filter this through pgindent, that would be best. Make
sure to add DBWriteRequest to src/tools/pgindent/typedefs_list.
Will do.
+ /* + * There's no request for this DB yet, so lets create it (allocate a + * space for it, set the values). + */ + if (last_statrequests == NULL) + last_statrequests = palloc(sizeof(DBWriteRequest)); + else + last_statrequests = repalloc(last_statrequests, + (num_statrequests + 1)*sizeof(DBWriteRequest)); + + last_statrequests[num_statrequests].databaseid = msg->databaseid; + last_statrequests[num_statrequests].request_time = msg->clock_time; + num_statrequests += 1;Having to repalloc this array each time seems wrong. Would a list
instead of an array help? see ilist.c/h; I vote for a dlist because you
can easily delete elements from the middle of it, if required (I think
you'd need that.)
Thanks. I'm not very familiar with the list interface, so I've used
plain array. But yes, there are better ways than doing repalloc all the
time.
+ char db_statfile[strlen(pgstat_stat_db_filename) + 11]; + snprintf(db_statfile, strlen(pgstat_stat_db_filename) + 11, + pgstat_stat_filename, dbentry->databaseid);This pattern seems rather frequent. Can we use a macro or similar here?
Encapsulating the "11" better would be good. Magic numbers are evil.
Yes, this needs to be cleaned / improved.
diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 613c1c2..b3467d2 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry PgStat_MsgHdr m_hdr; TimestampTz clock_time; /* observed local clock time */ TimestampTz cutoff_time; /* minimum acceptable file timestamp */ + Oid databaseid; /* requested DB (InvalidOid => all DBs) */ } PgStat_MsgInquiry;Do we need to support the case that somebody requests stuff from the
"shared" DB? IIRC that's what InvalidOid means in pgstat ...
Frankly, I don't know, but I guess we do because it was in the original
code, and there are such inquiries right after the database starts
(that's why I had to add pgstat_write_db_dummyfile).
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Feb 3, 2013 at 4:51 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
LOG: last_statwrite 11133-08-28 19:22:31.711744+02 is later than
collector's time 2013-02-04 00:54:21.113439+01 for db 19093
WARNING: pgstat wait timeout
LOG: last_statwrite 39681-12-23 18:48:48.9093+01 is later than
collector's time 2013-02-04 00:54:31.424681+01 for db 46494
FATAL: could not find block containing chunk 0x2af4a60
LOG: statistics collector process (PID 10063) exited with exit code 1
WARNING: pgstat wait timeout
WARNING: pgstat wait timeoutI'm not entirely sure where the FATAL came from, but it seems it was
somehow related to the issue - it was quite reproducible, although I
don't see how exactly could this happen. There relevant block of code
looks like this:char *writetime;
char *mytime;/* Copy because timestamptz_to_str returns a static buffer */
writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
mytime = pstrdup(timestamptz_to_str(cur_ts));
elog(LOG, "last_statwrite %s is later than collector's time %s for "
"db %d", writetime, mytime, dbentry->databaseid);
pfree(writetime);
pfree(mytime);which seems quite fine to mee. I'm not sure how one of the pfree calls
could fail?
I don't recall seeing the FATAL errors myself, but didn't keep the
logfile around. (I do recall seeing the pgstat wait timeout).
Are you using windows? pstrdup seems to be different there.
I'm afraid I don't have much to say on the code. Indeed I never even
look at it (other than grepping for pstrdup just now). I am taking a
purely experimental approach, Since Alvaro and others have looked at
the code.
Anyway, attached is a patch that fixes all three issues, i.e.
1) the un-initialized timestamp
2) the "void" ommited from the signature
3) rename to "pg_stat" instead of just "stat"
Thanks.
If I shutdown the server and blow away the stats with "rm
data/pg_stat/*", it recovers gracefully when I start it back up. If a
do "rm -r data/pg_stat" then it has problems the next time I shut it
down, but I have no right to do that in the first place. If I initdb
a database without this patch, then shut it down and restart with
binaries that include this patch, and need to manually make the
pg_stat directory. Does that mean it needs a catalog bump in order to
force an initdb?
A review:
It applies cleanly (some offsets, no fuzz), builds without warnings,
and passes make check including with cassert.
The final test done in "make check" inherently tests this code, and it
passes. If I intentionally break the patch by making
pgstat_read_db_statsfile add one to the oid it opens, then the test
fails. So the existing test is at least plausible as a test.
doc/src/sgml/monitoring.sgml needs to be changed: "a permanent copy of
the statistics data is stored in the global subdirectory". I'm not
aware of any other needed changes to the docs.
The big question is whether we want this. I think we do. While
having hundreds of databases in a cluster is not recommended, that is
no reason not to handle it better than we do. I don't see any
down-sides, other than possibly some code uglification. Some file
systems might not deal well with having lots of small stats files
being rapidly written and rewritten, but it is hard to see how the
current behavior would be more favorable for those systems.
We do not already have this. There is no relevant spec. I can't see
how this could need pg_dump support (but what about pg_upgrade?)
I am not aware of any dangers.
I have a question about its completeness. When I first start up the
cluster and have not yet touched it, there is very little stats
collector activity, either with or without this patch. When I kick
the cluster sufficiently (I've been using vacuumdb -a to do that) then
there is a lot of stats collector activity. Even once the vacuumdb
has long finished, this high level of activity continues even though
the database is otherwise completely idle, and this seems to happen
for every. This patch makes that high level of activity much more
efficient, but it does not reduce the activity. I don't understand
why an idle database cannot get back into the state right after
start-up.
I do not think that the patch needs to solve this problem in order to
be accepted, but if it can be addressed while the author and reviewers
are paying attention to this part of the system, that would be ideal.
And if not, then we should at least remember that there is future work
that could be done here.
I created 1000 databases each with 200 single column tables (x serial
primary key).
After vacuumdb -a, I let it idle for a long time to see what steady
state was reached.
without the patch:
vacuumdb -a real 11m2.624s
idle steady state: 48.17% user, 39.24% system, 11.78% iowait, 0.81% idle.
with the patch:
vacuumdb -a real 6m41.306s
idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle,
I also ran pgbench on a scale that fits in memory with fsync=off, on a
singe CPU machine. With the same above-mentioned 1000 databases as
unused decoys to bloat the stats file.
pgbench_tellers and branches undergo enough turn over that they should
get vacuumed every minuted (nap time).
Without the patch, they only get vacuumed every 40 minutes or so as
the autovac workers are so distracted by reading the bloated stats
file. and the TPS is ~680.
With the patch, they get vacuumed every 1 to 2 minutes and TPS is ~940
So this seems to be effective at it intended goal.
I have not done a review of the code itself, only the performance.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 5.2.2013 19:23, Jeff Janes wrote:
On Sun, Feb 3, 2013 at 4:51 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
LOG: last_statwrite 11133-08-28 19:22:31.711744+02 is later than
collector's time 2013-02-04 00:54:21.113439+01 for db 19093
WARNING: pgstat wait timeout
LOG: last_statwrite 39681-12-23 18:48:48.9093+01 is later than
collector's time 2013-02-04 00:54:31.424681+01 for db 46494
FATAL: could not find block containing chunk 0x2af4a60
LOG: statistics collector process (PID 10063) exited with exit code 1
WARNING: pgstat wait timeout
WARNING: pgstat wait timeoutI'm not entirely sure where the FATAL came from, but it seems it was
somehow related to the issue - it was quite reproducible, although I
don't see how exactly could this happen. There relevant block of code
looks like this:char *writetime;
char *mytime;/* Copy because timestamptz_to_str returns a static buffer */
writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
mytime = pstrdup(timestamptz_to_str(cur_ts));
elog(LOG, "last_statwrite %s is later than collector's time %s for "
"db %d", writetime, mytime, dbentry->databaseid);
pfree(writetime);
pfree(mytime);which seems quite fine to mee. I'm not sure how one of the pfree calls
could fail?I don't recall seeing the FATAL errors myself, but didn't keep the
logfile around. (I do recall seeing the pgstat wait timeout).Are you using windows? pstrdup seems to be different there.
Nope. I'll repeat the test with the original patch to find out what went
wrong, just to be sure it was fixed.
I'm afraid I don't have much to say on the code. Indeed I never even
look at it (other than grepping for pstrdup just now). I am taking a
purely experimental approach, Since Alvaro and others have looked at
the code.
Thanks for finding the issue with unitialized timestamp!
If I shutdown the server and blow away the stats with "rm
data/pg_stat/*", it recovers gracefully when I start it back up. If a
do "rm -r data/pg_stat" then it has problems the next time I shut it
down, but I have no right to do that in the first place. If I initdb
a database without this patch, then shut it down and restart with
binaries that include this patch, and need to manually make the
pg_stat directory. Does that mean it needs a catalog bump in order to
force an initdb?
Ummmm, what you mean by "catalog bump"?
Anyway, messing with files in the "base" directory is a bad idea in
general, and I don't think that's a reason to treat the pg_stat
directory differently. If you remove it by hand, you'll be rightfully
punished by various errors.
A review:
It applies cleanly (some offsets, no fuzz), builds without warnings,
and passes make check including with cassert.The final test done in "make check" inherently tests this code, and it
passes. If I intentionally break the patch by making
pgstat_read_db_statsfile add one to the oid it opens, then the test
fails. So the existing test is at least plausible as a test.doc/src/sgml/monitoring.sgml needs to be changed: "a permanent copy of
the statistics data is stored in the global subdirectory". I'm not
aware of any other needed changes to the docs.
Yeah, that should be "in the global/pg_stat subdirectory".
The big question is whether we want this. I think we do. While
having hundreds of databases in a cluster is not recommended, that is
no reason not to handle it better than we do. I don't see any
down-sides, other than possibly some code uglification. Some file
systems might not deal well with having lots of small stats files
being rapidly written and rewritten, but it is hard to see how the
current behavior would be more favorable for those systems.
If the filesystem has issues with that many entries, it's already hosed
by contents of the "base" directory (one per directory) or in the
database directories (multiple files per table).
Moreover, it's still possible to use tmpfs to handle this at runtime
(which is often the recommended solution with the current code), and use
the actual filesystem only for keeping the data across restarts.
We do not already have this. There is no relevant spec. I can't see
how this could need pg_dump support (but what about pg_upgrade?)
pg_dump - no
pg_upgrage - IMHO it should create the pg_stat directory. I don't think
it could "convert" statfile into the new format (by splitting it into
the pieces). I haven't checked but I believe the default behavior is to
delete it as there might be new fields / slight changes of meaning etc.
I am not aware of any dangers.
I have a question about its completeness. When I first start up the
cluster and have not yet touched it, there is very little stats
collector activity, either with or without this patch. When I kick
the cluster sufficiently (I've been using vacuumdb -a to do that) then
there is a lot of stats collector activity. Even once the vacuumdb
has long finished, this high level of activity continues even though
the database is otherwise completely idle, and this seems to happen
for every. This patch makes that high level of activity much more
efficient, but it does not reduce the activity. I don't understand
why an idle database cannot get back into the state right after
start-up.
What do you mean by "stats collector activity"? Is it reading/writing a
lot of data, or is it just using a lot of CPU?
Isn't that just a natural and expected behavior because the database
needs to actually perform ANALYZE to actually collect the data. Although
the tables are empty, it costs some CPU / IO and there's a lot of them
(1000 dbs, each with 200 tables).
I don't think there's a way around this. You may increase the autovacuum
naptime, but that's about all.
I do not think that the patch needs to solve this problem in order to
be accepted, but if it can be addressed while the author and reviewers
are paying attention to this part of the system, that would be ideal.
And if not, then we should at least remember that there is future work
that could be done here.
If I understand that correctly, you see the same behaviour even without
the patch, right? In that case I'd vote not to make the patch more
complex, and try to improve that separately (if it's even possible).
I created 1000 databases each with 200 single column tables (x serial
primary key).After vacuumdb -a, I let it idle for a long time to see what steady
state was reached.without the patch:
vacuumdb -a real 11m2.624s
idle steady state: 48.17% user, 39.24% system, 11.78% iowait, 0.81% idle.with the patch:
vacuumdb -a real 6m41.306s
idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle,
Nice. Another interesting numbers would be device utilization, average
I/O speed and required space (which should be ~2x the pgstat.stat size
without the patch).
I also ran pgbench on a scale that fits in memory with fsync=off, on a
singe CPU machine. With the same above-mentioned 1000 databases as
unused decoys to bloat the stats file.pgbench_tellers and branches undergo enough turn over that they should
get vacuumed every minuted (nap time).Without the patch, they only get vacuumed every 40 minutes or so as
the autovac workers are so distracted by reading the bloated stats
file. and the TPS is ~680.With the patch, they get vacuumed every 1 to 2 minutes and TPS is ~940
Great, I haven't really aimed to improve pgbench results, but it seems
natural that the decreased CPU utilization can go somewhere else. Not bad.
Have you moved the stats somewhere to tmpfs, or have you used the
default location (on disk)?
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
with the patch:
vacuumdb -a real 6m41.306s
idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle,Nice. Another interesting numbers would be device utilization, average
I/O speed and required space (which should be ~2x the pgstat.stat size
without the patch).
this point is important - with large warehouse with lot of databases
and tables you have move stat file to some ramdisc - without it you
lost lot of IO capacity - and it is very important if you need only
half sized ramdisc
Regards
Pavel
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Pavel Stehule <pavel.stehule@gmail.com> writes:
Nice. Another interesting numbers would be device utilization, average
I/O speed and required space (which should be ~2x the pgstat.stat size
without the patch).
this point is important - with large warehouse with lot of databases
and tables you have move stat file to some ramdisc - without it you
lost lot of IO capacity - and it is very important if you need only
half sized ramdisc
[ blink... ] I confess I'd not been paying close attention to this
thread, but if that's true I'd say the patch is DOA. Why should we
accept 2x bloat in the already-far-too-large stats file? I thought
the idea was just to split up the existing data into multiple files.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane escribió:
Pavel Stehule <pavel.stehule@gmail.com> writes:
Nice. Another interesting numbers would be device utilization, average
I/O speed and required space (which should be ~2x the pgstat.stat size
without the patch).this point is important - with large warehouse with lot of databases
and tables you have move stat file to some ramdisc - without it you
lost lot of IO capacity - and it is very important if you need only
half sized ramdisc[ blink... ] I confess I'd not been paying close attention to this
thread, but if that's true I'd say the patch is DOA. Why should we
accept 2x bloat in the already-far-too-large stats file? I thought
the idea was just to split up the existing data into multiple files.
I think they are saying just the opposite: maximum disk space
utilization is now half of the unpatched code. This is because when we
need to write the temporary file to rename on top of the other one, the
temporary file is not of the size of the complete pgstat data collation,
but just that for the requested database.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2013/2/6 Alvaro Herrera <alvherre@2ndquadrant.com>:
Tom Lane escribió:
Pavel Stehule <pavel.stehule@gmail.com> writes:
Nice. Another interesting numbers would be device utilization, average
I/O speed and required space (which should be ~2x the pgstat.stat size
without the patch).this point is important - with large warehouse with lot of databases
and tables you have move stat file to some ramdisc - without it you
lost lot of IO capacity - and it is very important if you need only
half sized ramdisc[ blink... ] I confess I'd not been paying close attention to this
thread, but if that's true I'd say the patch is DOA. Why should we
accept 2x bloat in the already-far-too-large stats file? I thought
the idea was just to split up the existing data into multiple files.I think they are saying just the opposite: maximum disk space
utilization is now half of the unpatched code. This is because when we
need to write the temporary file to rename on top of the other one, the
temporary file is not of the size of the complete pgstat data collation,
but just that for the requested database.
+1
Pavel
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Dne 06.02.2013 16:53, Alvaro Herrera napsal:
Tom Lane escribió:
Pavel Stehule <pavel.stehule@gmail.com> writes:
Nice. Another interesting numbers would be device utilization,
average
I/O speed and required space (which should be ~2x the pgstat.stat
size
without the patch).
this point is important - with large warehouse with lot of
databases
and tables you have move stat file to some ramdisc - without it
you
lost lot of IO capacity - and it is very important if you need
only
half sized ramdisc
[ blink... ] I confess I'd not been paying close attention to this
thread, but if that's true I'd say the patch is DOA. Why should we
accept 2x bloat in the already-far-too-large stats file? I thought
the idea was just to split up the existing data into multiple files.I think they are saying just the opposite: maximum disk space
utilization is now half of the unpatched code. This is because when
we
need to write the temporary file to rename on top of the other one,
the
temporary file is not of the size of the complete pgstat data
collation,
but just that for the requested database.
Exactly. And I suspect the current (unpatched) code ofter requires more
than
twice the space because of open file descriptors to already deleted
files.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 5, 2013 at 2:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
On 5.2.2013 19:23, Jeff Janes wrote:
If I shutdown the server and blow away the stats with "rm
data/pg_stat/*", it recovers gracefully when I start it back up. If a
do "rm -r data/pg_stat" then it has problems the next time I shut it
down, but I have no right to do that in the first place. If I initdb
a database without this patch, then shut it down and restart with
binaries that include this patch, and need to manually make the
pg_stat directory. Does that mean it needs a catalog bump in order to
force an initdb?Ummmm, what you mean by "catalog bump"?
There is a catalog number in src/include/catalog/catversion.h, which
when changed forces one to redo initdb.
Formally I guess it is only for system catalog changes, but I thought
it was used for any on-disk changes during development cycles. I like
it the way it is, as I can use the same data directory for both
versions of the binary (patched and unpatched), and just manually
create or remove the directory pg_stat directory when changing modes.
That is ideal for testing this patch, probably not ideal for being
committed into the tree along with all the other ongoing devel work.
But I think this is something the committer has to worry about.
I have a question about its completeness. When I first start up the
cluster and have not yet touched it, there is very little stats
collector activity, either with or without this patch. When I kick
the cluster sufficiently (I've been using vacuumdb -a to do that) then
there is a lot of stats collector activity. Even once the vacuumdb
has long finished, this high level of activity continues even though
the database is otherwise completely idle, and this seems to happen
for every. This patch makes that high level of activity much more
efficient, but it does not reduce the activity. I don't understand
why an idle database cannot get back into the state right after
start-up.What do you mean by "stats collector activity"? Is it reading/writing a
lot of data, or is it just using a lot of CPU?
Basically, the launching of new autovac workers and the work that that
entails. Your patch reduces the size of data that needs to be
written, read, and parsed for every launch, but not the number of
times that that happens.
Isn't that just a natural and expected behavior because the database
needs to actually perform ANALYZE to actually collect the data. Although
the tables are empty, it costs some CPU / IO and there's a lot of them
(1000 dbs, each with 200 tables).
It isn't touching the tables at all, just the stats files.
I was wrong about the cluster opening quiet. It only does that if,
while the cluster was shutdown, you remove the statistics files which
I was doing, as I was switching back and forth between patched and
unpatched.
When the cluster opens, any databases that don't have statistics in
the stat file(s) will not get an autovacuum worker process spawned.
They only start getting spawned once someone asks for statistics for
that database. But then once that happens, that database then gets a
worker spawned for it every naptime (or, at least, as close to that as
the server can keep up with) for eternity, even if that database is
never used again. The only way to stop this is the unsupported way of
blowing away the permanent stats files.
I don't think there's a way around this. You may increase the autovacuum
naptime, but that's about all.I do not think that the patch needs to solve this problem in order to
be accepted, but if it can be addressed while the author and reviewers
are paying attention to this part of the system, that would be ideal.
And if not, then we should at least remember that there is future work
that could be done here.If I understand that correctly, you see the same behaviour even without
the patch, right? In that case I'd vote not to make the patch more
complex, and try to improve that separately (if it's even possible).
OK. I just thought that while digging through the code, you might
have a good idea for fixing this part as well. If so, it would be a
shame for that idea to be lost when you move on to other things.
I created 1000 databases each with 200 single column tables (x serial
primary key).After vacuumdb -a, I let it idle for a long time to see what steady
state was reached.without the patch:
vacuumdb -a real 11m2.624s
idle steady state: 48.17% user, 39.24% system, 11.78% iowait, 0.81% idle.with the patch:
vacuumdb -a real 6m41.306s
idle steady state: 7.86% user, 5.00% system 0.09% iowait 87% idle,Nice. Another interesting numbers would be device utilization, average
I/O speed
I didn't gather that data, as I never figured out how to interpret
those numbers and so don't have much faith in them. (But I am pretty
impressed with the numbers I do understand)
and required space (which should be ~2x the pgstat.stat size
without the patch).
I didn't study this in depth, but the patch seems to do what it should
(that is, take less space, not more). If I fill the device up so
that there is less than 3x the size of the stats file available for
use (i.e. space for the file itself and for 1 temp copy version of it
but not space for a complete second temp copy), I occasionally get
out-of-space warning with unpatched. But never get those errors with
patched. Indeed, with patch I never get warnings even with only 1.04
times the aggregate size of the stats files available for use. (That
is, size for all the files, plus just 1/25 that amount to spare.
Obviously this limit is specific to having 1000 databases of equal
size.)
I also ran pgbench on a scale that fits in memory with fsync=off, on a
singe CPU machine. With the same above-mentioned 1000 databases as
unused decoys to bloat the stats file.pgbench_tellers and branches undergo enough turn over that they should
get vacuumed every minuted (nap time).Without the patch, they only get vacuumed every 40 minutes or so as
the autovac workers are so distracted by reading the bloated stats
file. and the TPS is ~680.With the patch, they get vacuumed every 1 to 2 minutes and TPS is ~940
Great, I haven't really aimed to improve pgbench results, but it seems
natural that the decreased CPU utilization can go somewhere else. Not bad.
My goal there was to prove to myself that the correct tables were
getting vacuumed. The TPS measurements were just a by-product of
that, but since I had them I figured I'd post them.
Have you moved the stats somewhere to tmpfs, or have you used the
default location (on disk)?
All the specific work I reported was with them on disk, except the
part about running out of space, which was done on /dev/shm. But even
with the data theoretically going to disk, the kernel caches it well
enough that I wouldn't expect things to change very much.
Two more questions I've come up with:
If I invoke pg_stat_reset() from a database, the corresponding file
does not get removed from the pg_stat_tmp directory. And when the
database is shut down, a file for the reset database does get created
in pg_stat. Is this OK?
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jeff Janes <jeff.janes@gmail.com> writes:
On Tue, Feb 5, 2013 at 2:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
Ummmm, what you mean by "catalog bump"?
There is a catalog number in src/include/catalog/catversion.h, which
when changed forces one to redo initdb.
Formally I guess it is only for system catalog changes, but I thought
it was used for any on-disk changes during development cycles.
Yeah, it would be appropriate to bump the catversion if we're creating a
new PGDATA subdirectory.
I'm not excited about keeping code to take care of the lack of such a
subdirectory at runtime, as I gather there is in the current state of
the patch. Formally, if there were such code, we'd not need a
catversion bump --- the rule of thumb is to change catversion if the new
postgres executable would fail regression tests without a run of the new
initdb. But it's pretty dumb to keep such code indefinitely, when it
would have no more possible use after the next catversion bump (which is
seldom more than a week or two away during devel phase).
What do you mean by "stats collector activity"? Is it reading/writing a
lot of data, or is it just using a lot of CPU?
Basically, the launching of new autovac workers and the work that that
entails. Your patch reduces the size of data that needs to be
written, read, and parsed for every launch, but not the number of
times that that happens.
It doesn't seem very reasonable to ask this patch to redesign the
autovacuum algorithms, which is essentially what it'll take to improve
that. That's a completely separate layer of code.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7.2.2013 00:40, Tom Lane wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
On Tue, Feb 5, 2013 at 2:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
Ummmm, what you mean by "catalog bump"?
There is a catalog number in src/include/catalog/catversion.h, which
when changed forces one to redo initdb.Formally I guess it is only for system catalog changes, but I thought
it was used for any on-disk changes during development cycles.Yeah, it would be appropriate to bump the catversion if we're creating a
new PGDATA subdirectory.I'm not excited about keeping code to take care of the lack of such a
subdirectory at runtime, as I gather there is in the current state of
the patch. Formally, if there were such code, we'd not need a
No, there is nothing to handle that at runtime. The directory is created
at initdb and the patch expects that (and fails if it's gone).
catversion bump --- the rule of thumb is to change catversion if the new
postgres executable would fail regression tests without a run of the new
initdb. But it's pretty dumb to keep such code indefinitely, when it
would have no more possible use after the next catversion bump (which is
seldom more than a week or two away during devel phase).What do you mean by "stats collector activity"? Is it reading/writing a
lot of data, or is it just using a lot of CPU?Basically, the launching of new autovac workers and the work that that
entails. Your patch reduces the size of data that needs to be
written, read, and parsed for every launch, but not the number of
times that that happens.It doesn't seem very reasonable to ask this patch to redesign the
autovacuum algorithms, which is essentially what it'll take to improve
that. That's a completely separate layer of code.
My opinion, exactly.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Feb 5, 2013 at 2:31 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
We do not already have this. There is no relevant spec. I can't see
how this could need pg_dump support (but what about pg_upgrade?)pg_dump - no
pg_upgrage - IMHO it should create the pg_stat directory. I don't think
it could "convert" statfile into the new format (by splitting it into
the pieces). I haven't checked but I believe the default behavior is to
delete it as there might be new fields / slight changes of meaning etc.
Right, I have no concerns with pg_upgrade any more. The pg_stat will
inherently get created by the initdb of the new cluster (because the
initdb will done with the new binaries with your patch in place them).
pg_upgrade currently doesn't copy over global/pgstat.stat. So that
means the new cluster doesn't have the activity stats either way,
patch or unpatched. So if it is not currently a problem it will not
become one under the proposed patch.
Cheers,
Jeff
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Here's an updated version of this patch that takes care of the issues I
reported previously: no more repalloc() of the requests array; it's now
an slist, which makes the code much more natural IMV. And no more
messing around with doing sprintf to create a separate sprintf pattern
for the per-db stats file; instead have a function to return the name
that uses just the pgstat dir as stored by GUC. I think this can be
further simplified still.
I haven't reviewed the rest yet; please do give this a try to confirm
that the speedups previously reported are still there (i.e. I didn't
completely blew it).
Thanks
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
stats-split-v7.patchtext/x-diff; charset=us-asciiDownload
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 38,43 ****
--- 38,44 ----
#include "access/xact.h"
#include "catalog/pg_database.h"
#include "catalog/pg_proc.h"
+ #include "lib/ilist.h"
#include "libpq/ip.h"
#include "libpq/libpq.h"
#include "libpq/pqsignal.h"
***************
*** 66,73 ****
* Paths for the statistics files (relative to installation's $PGDATA).
* ----------
*/
! #define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat"
! #define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp"
/* ----------
* Timer definitions.
--- 67,75 ----
* Paths for the statistics files (relative to installation's $PGDATA).
* ----------
*/
! #define PGSTAT_STAT_PERMANENT_DIRECTORY "pg_stat"
! #define PGSTAT_STAT_PERMANENT_FILENAME "pg_stat/global.stat"
! #define PGSTAT_STAT_PERMANENT_TMPFILE "pg_stat/global.tmp"
/* ----------
* Timer definitions.
***************
*** 115,120 **** int pgstat_track_activity_query_size = 1024;
--- 117,123 ----
* Built from GUC parameter
* ----------
*/
+ char *pgstat_stat_directory = NULL;
char *pgstat_stat_filename = NULL;
char *pgstat_stat_tmpname = NULL;
***************
*** 219,229 **** static int localNumBackends = 0;
*/
static PgStat_GlobalStats globalStats;
! /* Last time the collector successfully wrote the stats file */
! static TimestampTz last_statwrite;
! /* Latest statistics request time from backends */
! static TimestampTz last_statrequest;
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
--- 222,237 ----
*/
static PgStat_GlobalStats globalStats;
! /* Write request info for each database */
! typedef struct DBWriteRequest
! {
! Oid databaseid; /* OID of the database to write */
! TimestampTz request_time; /* timestamp of the last write request */
! slist_node next;
! } DBWriteRequest;
! /* Latest statistics request time from backends for each DB */
! static slist_head last_statrequests = SLIST_STATIC_INIT(last_statrequests);
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
***************
*** 252,262 **** static void pgstat_sighup_handler(SIGNAL_ARGS);
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
! static void pgstat_write_statsfile(bool permanent);
! static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
static HTAB *pgstat_collect_oids(Oid catalogid);
--- 260,276 ----
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
! static void pgstat_write_statsfile(bool permanent, bool force);
! static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent);
! static void pgstat_write_db_dummyfile(Oid databaseid);
! static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs);
! static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
+ static bool pgstat_write_statsfile_needed(void);
+ static bool pgstat_db_requested(Oid databaseid);
+
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
static HTAB *pgstat_collect_oids(Oid catalogid);
***************
*** 285,291 **** static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int le
static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
--- 299,304 ----
***************
*** 549,556 **** startup_failed:
void
pgstat_reset_all(void)
{
! unlink(pgstat_stat_filename);
! unlink(PGSTAT_STAT_PERMANENT_FILENAME);
}
#ifdef EXEC_BACKEND
--- 562,605 ----
void
pgstat_reset_all(void)
{
! DIR * dir;
! struct dirent * entry;
!
! dir = AllocateDir(pgstat_stat_directory);
! while ((entry = ReadDir(dir, pgstat_stat_directory)) != NULL)
! {
! char *fname;
! int totlen;
!
! if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
! continue;
!
! totlen = strlen(pgstat_stat_directory) + strlen(entry->d_name) + 2;
! fname = palloc(totlen);
!
! snprintf(fname, totlen, "%s/%s", pgstat_stat_directory, entry->d_name);
! unlink(fname);
! pfree(fname);
! }
! FreeDir(dir);
!
! dir = AllocateDir(PGSTAT_STAT_PERMANENT_DIRECTORY);
! while ((entry = ReadDir(dir, PGSTAT_STAT_PERMANENT_DIRECTORY)) != NULL)
! {
! char *fname;
! int totlen;
!
! if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
! continue;
!
! totlen = strlen(pgstat_stat_directory) + strlen(entry->d_name) + 2;
! fname = palloc(totlen);
!
! snprintf(fname, totlen, "%s/%s", PGSTAT_STAT_PERMANENT_FILENAME, entry->d_name);
! unlink(fname);
! pfree(fname);
! }
! FreeDir(dir);
}
#ifdef EXEC_BACKEND
***************
*** 1408,1420 **** pgstat_ping(void)
* ----------
*/
static void
! pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
pgstat_send(&msg, sizeof(msg));
}
--- 1457,1470 ----
* ----------
*/
static void
! pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
***************
*** 3004,3009 **** PgstatCollectorMain(int argc, char *argv[])
--- 3054,3060 ----
int len;
PgStat_Msg msg;
int wr;
+ bool first_write = true;
IsUnderPostmaster = true; /* we are a postmaster subprocess now */
***************
*** 3053,3069 **** PgstatCollectorMain(int argc, char *argv[])
init_ps_display("stats collector process", "", "", "");
/*
- * Arrange to write the initial status file right away
- */
- last_statrequest = GetCurrentTimestamp();
- last_statwrite = last_statrequest - 1;
-
- /*
* Read in an existing statistics stats file or initialize the stats to
! * zero.
*/
pgStatRunningInCollector = true;
! pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
--- 3104,3114 ----
init_ps_display("stats collector process", "", "", "");
/*
* Read in an existing statistics stats file or initialize the stats to
! * zero (read data for all databases, including table/func stats).
*/
pgStatRunningInCollector = true;
! pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, false);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
***************
*** 3107,3116 **** PgstatCollectorMain(int argc, char *argv[])
/*
* Write the stats file if a new request has arrived that is not
! * satisfied by existing file.
*/
! if (last_statwrite < last_statrequest)
! pgstat_write_statsfile(false);
/*
* Try to receive and process a message. This will not block,
--- 3152,3165 ----
/*
* Write the stats file if a new request has arrived that is not
! * satisfied by existing file (force writing all files if it's
! * the first write after startup).
*/
! if (first_write || pgstat_write_statsfile_needed())
! {
! pgstat_write_statsfile(false, first_write);
! first_write = false;
! }
/*
* Try to receive and process a message. This will not block,
***************
*** 3269,3275 **** PgstatCollectorMain(int argc, char *argv[])
/*
* Save the final stats to reuse at next startup.
*/
! pgstat_write_statsfile(true);
exit(0);
}
--- 3318,3324 ----
/*
* Save the final stats to reuse at next startup.
*/
! pgstat_write_statsfile(true, true);
exit(0);
}
***************
*** 3349,3354 **** pgstat_get_db_entry(Oid databaseid, bool create)
--- 3398,3404 ----
result->n_block_write_time = 0;
result->stat_reset_timestamp = GetCurrentTimestamp();
+ result->stats_timestamp = 0;
memset(&hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(Oid);
***************
*** 3429,3451 **** pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
* shutting down only), remove the temporary file so that backends
* starting up under a new postmaster can't read the old data before
* the new collector is ready.
* ----------
*/
static void
! pgstat_write_statsfile(bool permanent)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
/*
* Open the statistics temp file to write out the current values.
*/
--- 3479,3503 ----
* shutting down only), remove the temporary file so that backends
* starting up under a new postmaster can't read the old data before
* the new collector is ready.
+ *
+ * When 'allDbs' is false, only the requested databases (listed in
+ * last_statrequests) will be written. If 'allDbs' is true, all databases
+ * will be written.
* ----------
*/
static void
! pgstat_write_statsfile(bool permanent, bool allDbs)
{
HASH_SEQ_STATUS hstat;
PgStat_StatDBEntry *dbentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
/*
* Open the statistics temp file to write out the current values.
*/
***************
*** 3484,3489 **** pgstat_write_statsfile(bool permanent)
--- 3536,3555 ----
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
{
/*
+ * Write our the tables and functions into a separate file, but only
+ * if the database is in the requests or if all DBs are to be written.
+ *
+ * We need to do this before the dbentry write to write the proper
+ * timestamp to the global file.
+ */
+ if (allDbs || pgstat_db_requested(dbentry->databaseid))
+ {
+ elog(DEBUG1, "writing statsfile for DB %d", dbentry->databaseid);
+ dbentry->stats_timestamp = globalStats.stats_timestamp;
+ pgstat_write_db_statsfile(dbentry, permanent);
+ }
+
+ /*
* Write out the DB entry including the number of live backends. We
* don't write the tables or functions pointers, since they're of no
* use to any other process.
***************
*** 3493,3521 **** pgstat_write_statsfile(bool permanent)
(void) rc; /* we'll check for error with ferror */
/*
- * Walk through the database's access stats per table.
- */
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
- {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
* Mark the end of this DB
*/
fputc('d', fpout);
}
--- 3559,3568 ----
(void) rc; /* we'll check for error with ferror */
/*
* Mark the end of this DB
+ *
+ * TODO Does using these chars still make sense, when the tables/func
+ * stats are moved to a separate file?
*/
fputc('d', fpout);
}
***************
*** 3527,3532 **** pgstat_write_statsfile(bool permanent)
--- 3574,3607 ----
*/
fputc('E', fpout);
+ /* In any case, we can just throw away all the db requests, but we need to
+ * write dummy files for databases without a stat entry (it would cause
+ * issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts).
+ * This may happen e.g. for shared DB (oid = 0) right after initdb.
+ */
+ if (!slist_is_empty(&last_statrequests))
+ {
+ slist_mutable_iter iter;
+
+ slist_foreach_modify(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next,
+ iter.cur);
+
+ /*
+ * Create dummy files for requested databases without a proper
+ * dbentry. It's much easier this way than dealing with multiple
+ * timestamps, possibly existing but not yet written DBs etc.
+ * */
+ if (!pgstat_get_db_entry(req->databaseid, false))
+ pgstat_write_db_dummyfile(req->databaseid);
+
+ pfree(req);
+ }
+
+ slist_init(&last_statrequests);
+ }
+
if (ferror(fpout))
{
ereport(LOG,
***************
*** 3552,3608 **** pgstat_write_statsfile(bool permanent)
tmpfile, statfile)));
unlink(tmpfile);
}
- else
- {
- /*
- * Successful write, so update last_statwrite.
- */
- last_statwrite = globalStats.stats_timestamp;
-
- /*
- * If there is clock skew between backends and the collector, we could
- * receive a stats request time that's in the future. If so, complain
- * and reset last_statrequest. Resetting ensures that no inquiry
- * message can cause more than one stats file write to occur.
- */
- if (last_statrequest > last_statwrite)
- {
- char *reqtime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- reqtime = pstrdup(timestamptz_to_str(last_statrequest));
- mytime = pstrdup(timestamptz_to_str(last_statwrite));
- elog(LOG, "last_statrequest %s is later than collector's time %s",
- reqtime, mytime);
- pfree(reqtime);
- pfree(mytime);
-
- last_statrequest = last_statwrite;
- }
- }
if (permanent)
unlink(pgstat_stat_filename);
}
/* ----------
* pgstat_read_statsfile() -
*
* Reads in an existing statistics collector file and initializes the
* databases' hash table (whose entries point to the tables' hash tables).
* ----------
*/
static HTAB *
! pgstat_read_statsfile(Oid onlydb, bool permanent)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
HTAB *tabhash = NULL;
--- 3627,3905 ----
tmpfile, statfile)));
unlink(tmpfile);
}
if (permanent)
unlink(pgstat_stat_filename);
}
+ /*
+ * return the length that a DB stat file would have (including terminating \0)
+ *
+ * XXX We could avoid this overhead by caching a maximum length in
+ * assign_pgstat_temp_directory; also the distinctions on "permanent" and
+ * "tempname" seem pointless (what do you mean to save one byte of stack
+ * space!?)
+ */
+ static int
+ get_dbstat_file_len(bool permanent, bool tempname, Oid databaseid)
+ {
+ char tmp[1];
+ int len;
+
+ /* don't actually print, but return how many chars would be used */
+ len = snprintf(tmp, 1, "%s/db_%u.%s",
+ permanent ? "pg_stat" : pgstat_stat_directory,
+ databaseid,
+ tempname ? "tmp" : "stat");
+ /* XXX pointless? */
+ if (len >= MAXPGPATH)
+ elog(PANIC, "pgstat path too long");
+
+ /* count terminating \0 */
+ return len + 1;
+ }
+
+ /*
+ * return the filename for a DB stat file; filename is the output buffer,
+ * and len is its length.
+ */
+ static void
+ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+ char *filename, int len)
+ {
+ #ifdef USE_ASSERT_CHECKING
+ int printed;
+
+ printed =
+ #endif
+ snprintf(filename, len, "%s/db_%u.%s",
+ permanent ? "pg_stat" : pgstat_stat_directory,
+ databaseid,
+ tempname ? "tmp" : "stat");
+ Assert(printed <= len);
+ }
+
+ /* ----------
+ * pgstat_write_db_statsfile() -
+ *
+ * Tell the news. This writes stats file for a single database.
+ *
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+ static void
+ pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+ {
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int32 format_id;
+ Oid dbid = dbentry->databaseid;
+ int rc;
+ int tmpfilelen = get_dbstat_file_len(permanent, true, dbid);
+ char tmpfile[tmpfilelen];
+ int statfilelen = get_dbstat_file_len(permanent, false, dbid);
+ char statfile[statfilelen];
+
+ get_dbstat_filename(permanent, true, dbid, tmpfile, tmpfilelen);
+ get_dbstat_filename(permanent, false, dbid, statfile, statfilelen);
+
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ tmpfile)));
+ return;
+ }
+
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file \"%s\": %m",
+ tmpfile)));
+ FreeFile(fpout);
+ unlink(tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file \"%s\": %m",
+ tmpfile)));
+ unlink(tmpfile);
+ }
+ else if (rename(tmpfile, statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+ tmpfile, statfile)));
+ unlink(tmpfile);
+ }
+
+ if (permanent)
+ {
+ elog(DEBUG1, "removing temporary stat file '%s'", tmpfile);
+ unlink(tmpfile);
+ }
+ }
+
+
+ /* ----------
+ * pgstat_write_db_dummyfile() -
+ *
+ * All this does is writing a dummy stat file for databases without dbentry
+ * yet. It basically writes just a file header - format ID and a timestamp.
+ * ----------
+ */
+ static void
+ pgstat_write_db_dummyfile(Oid databaseid)
+ {
+ FILE *fpout;
+ int32 format_id;
+ int rc;
+ int tmpfilelen = get_dbstat_file_len(false, true, databaseid);
+ char tmpfile[tmpfilelen];
+ int statfilelen = get_dbstat_file_len(false, false, databaseid);
+ char statfile[statfilelen];
+
+ get_dbstat_filename(false, true, databaseid, tmpfile, tmpfilelen);
+ get_dbstat_filename(false, false, databaseid, statfile, statfilelen);
+
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ tmpfile)));
+ return;
+ }
+
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary dummy statistics file \"%s\": %m",
+ tmpfile)));
+ FreeFile(fpout);
+ unlink(tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary dummy statistics file \"%s\": %m",
+ tmpfile)));
+ unlink(tmpfile);
+ }
+ else if (rename(tmpfile, statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary dummy statistics file \"%s\" to \"%s\": %m",
+ tmpfile, statfile)));
+ unlink(tmpfile);
+ }
+
+ }
/* ----------
* pgstat_read_statsfile() -
*
* Reads in an existing statistics collector file and initializes the
* databases' hash table (whose entries point to the tables' hash tables).
+ *
+ * Allows reading only the global stats (at database level), which is just
+ * enough for many purposes (e.g. autovacuum launcher etc.). If this is
+ * sufficient for you, use onlydbs=true.
* ----------
*/
static HTAB *
! pgstat_read_statsfile(Oid onlydb, bool permanent, bool onlydbs)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
HASHCTL hash_ctl;
HTAB *dbhash;
HTAB *tabhash = NULL;
***************
*** 3613,3618 **** pgstat_read_statsfile(Oid onlydb, bool permanent)
--- 3910,3920 ----
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
/*
+ * If we want a db-level stats only, we don't want a particular db.
+ */
+ Assert(!((onlydb != InvalidOid) && onlydbs));
+
+ /*
* The tables will live in pgStatLocalContext.
*/
pgstat_setup_memcxt();
***************
*** 3758,3763 **** pgstat_read_statsfile(Oid onlydb, bool permanent)
--- 4060,4075 ----
*/
tabhash = dbentry->tables;
funchash = dbentry->functions;
+
+ /*
+ * Read the data from the file for this database. If there was
+ * onlydb specified (!= InvalidOid), we would not get here because
+ * of a break above. So we don't need to recheck.
+ */
+ if (!onlydbs)
+ pgstat_read_db_statsfile(dbentry->databaseid, tabhash, funchash,
+ permanent);
+
break;
/*
***************
*** 3768,3773 **** pgstat_read_statsfile(Oid onlydb, bool permanent)
--- 4080,4177 ----
funchash = NULL;
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+ done:
+ FreeFile(fpin);
+
+ if (permanent)
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+
+ return dbhash;
+ }
+
+
+ /* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+ static void
+ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent)
+ {
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ int32 format_id;
+ TimestampTz timestamp;
+ bool found;
+ int statfilelen = get_dbstat_file_len(permanent, false, databaseid);
+ char statfile[statfilelen];
+
+ get_dbstat_filename(permanent, false, databaseid, statfile, statfilelen);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file \"%s\": %m",
+ statfile)));
+ return;
+ }
+
+ /*
+ * Verify it's of the expected format.
+ */
+ if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
+ || format_id != PGSTAT_FILE_FORMAT_ID)
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ goto done;
+ }
+
+ /*
+ * Read global stats struct
+ */
+ if (fread(×tamp, 1, sizeof(timestamp), fpin) != sizeof(timestamp))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ goto done;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
***************
*** 3854,3878 **** done:
FreeFile(fpin);
if (permanent)
! unlink(PGSTAT_STAT_PERMANENT_FILENAME);
! return dbhash;
}
/* ----------
! * pgstat_read_statsfile_timestamp() -
*
! * Attempt to fetch the timestamp of an existing stats file.
* Returns TRUE if successful (timestamp is stored at *ts).
* ----------
*/
static bool
! pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
{
! PgStat_GlobalStats myGlobalStats;
FILE *fpin;
int32 format_id;
! const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
/*
* Try to open the status file. As above, anything but ENOENT is worthy
--- 4258,4294 ----
FreeFile(fpin);
if (permanent)
! {
! int statfilelen = get_dbstat_file_len(permanent, false, databaseid);
! char statfile[statfilelen];
! get_dbstat_filename(permanent, false, databaseid, statfile, statfilelen);
!
! elog(DEBUG1, "removing permanent stats file '%s'", statfile);
! unlink(statfile);
! }
!
! return;
}
+
/* ----------
! * pgstat_read_db_statsfile_timestamp() -
*
! * Attempt to fetch the timestamp of an existing stats file (for a DB).
* Returns TRUE if successful (timestamp is stored at *ts).
* ----------
*/
static bool
! pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, TimestampTz *ts)
{
! TimestampTz timestamp;
FILE *fpin;
int32 format_id;
! int filenamelen = get_dbstat_file_len(permanent, false, databaseid);
! char statfile[filenamelen];
!
! get_dbstat_filename(permanent, false, databaseid, statfile, filenamelen);
/*
* Try to open the status file. As above, anything but ENOENT is worthy
***************
*** 3903,3909 **** pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
/*
* Read global stats struct
*/
! if (fread(&myGlobalStats, 1, sizeof(myGlobalStats), fpin) != sizeof(myGlobalStats))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"", statfile)));
--- 4319,4325 ----
/*
* Read global stats struct
*/
! if (fread(×tamp, 1, sizeof(TimestampTz), fpin) != sizeof(TimestampTz))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"", statfile)));
***************
*** 3911,3917 **** pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
return false;
}
! *ts = myGlobalStats.stats_timestamp;
FreeFile(fpin);
return true;
--- 4327,4333 ----
return false;
}
! *ts = timestamp;
FreeFile(fpin);
return true;
***************
*** 3947,3953 **** backend_read_statsfile(void)
CHECK_FOR_INTERRUPTS();
! ok = pgstat_read_statsfile_timestamp(false, &file_ts);
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
--- 4363,4369 ----
CHECK_FOR_INTERRUPTS();
! ok = pgstat_read_db_statsfile_timestamp(MyDatabaseId, false, &file_ts);
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
***************
*** 4006,4012 **** backend_read_statsfile(void)
pfree(mytime);
}
! pgstat_send_inquiry(cur_ts, min_ts);
break;
}
--- 4422,4428 ----
pfree(mytime);
}
! pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
break;
}
***************
*** 4016,4022 **** backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
! pgstat_send_inquiry(cur_ts, min_ts);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
--- 4432,4438 ----
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
! pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
***************
*** 4026,4034 **** backend_read_statsfile(void)
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
! pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
else
! pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
}
--- 4442,4457 ----
/* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
! /*
! * FIXME Does it really need info including tables/functions? Or is it enough to read
! * database-level stats? It seems to me the launcher needs PgStat_StatDBEntry only
! * (at least that's how I understand the rebuild_database_list() in autovacuum.c),
! * because pgstat_stattabentries are used in do_autovacuum() only, that that's what's
! * executed in workers ... So maybe we'd be just fine by reading in the dbentries?
! */
! pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, true);
else
! pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false, false);
}
***************
*** 4084,4109 **** pgstat_clear_snapshot(void)
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
/*
! * Advance last_statrequest if this requestor has a newer cutoff time
! * than any previous request.
*/
! if (msg->cutoff_time > last_statrequest)
! last_statrequest = msg->cutoff_time;
/*
! * If the requestor's local clock time is older than last_statwrite, we
* should suspect a clock glitch, ie system time going backwards; though
* the more likely explanation is just delayed message receipt. It is
* worth expending a GetCurrentTimestamp call to be sure, since a large
* retreat in the system clock reading could otherwise cause us to neglect
* to update the stats file for a long time.
*/
! if (msg->clock_time < last_statwrite)
{
TimestampTz cur_ts = GetCurrentTimestamp();
! if (cur_ts < last_statwrite)
{
/*
* Sure enough, time went backwards. Force a new stats file write
--- 4507,4559 ----
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
+ slist_iter iter;
+ bool found = false;
+ DBWriteRequest *newreq;
+ PgStat_StatDBEntry *dbentry;
+
+ elog(DEBUG1, "received inquiry for %d", msg->databaseid);
+
+ /*
+ * Find the last write request for this DB (found=true in that case). Plain
+ * linear search, not really worth doing any magic here (probably).
+ */
+ slist_foreach(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur);
+
+ if (req->databaseid != msg->databaseid)
+ continue;
+
+ if (msg->cutoff_time > req->request_time)
+ req->request_time = msg->cutoff_time;
+ found = true;
+ return;
+ }
+
/*
! * There's no request for this DB yet, so create one.
*/
! newreq = palloc(sizeof(DBWriteRequest));
!
! newreq->databaseid = msg->databaseid;
! newreq->request_time = msg->clock_time;
! slist_push_head(&last_statrequests, &newreq->next);
/*
! * If the requestor's local clock time is older than stats_timestamp, we
* should suspect a clock glitch, ie system time going backwards; though
* the more likely explanation is just delayed message receipt. It is
* worth expending a GetCurrentTimestamp call to be sure, since a large
* retreat in the system clock reading could otherwise cause us to neglect
* to update the stats file for a long time.
*/
! dbentry = pgstat_get_db_entry(msg->databaseid, false);
! if ((dbentry != NULL) && (msg->clock_time < dbentry->stats_timestamp))
{
TimestampTz cur_ts = GetCurrentTimestamp();
! if (cur_ts < dbentry->stats_timestamp)
{
/*
* Sure enough, time went backwards. Force a new stats file write
***************
*** 4113,4127 **** pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
char *mytime;
/* Copy because timestamptz_to_str returns a static buffer */
! writetime = pstrdup(timestamptz_to_str(last_statwrite));
mytime = pstrdup(timestamptz_to_str(cur_ts));
! elog(LOG, "last_statwrite %s is later than collector's time %s",
! writetime, mytime);
pfree(writetime);
pfree(mytime);
! last_statrequest = cur_ts;
! last_statwrite = last_statrequest - 1;
}
}
}
--- 4563,4578 ----
char *mytime;
/* Copy because timestamptz_to_str returns a static buffer */
! writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
mytime = pstrdup(timestamptz_to_str(cur_ts));
! elog(LOG,
! "stats_timestamp %s is later than collector's time %s for db %d",
! writetime, mytime, dbentry->databaseid);
pfree(writetime);
pfree(mytime);
! newreq->request_time = cur_ts;
! dbentry->stats_timestamp = cur_ts - 1;
}
}
}
***************
*** 4270,4298 **** pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
static void
pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
{
PgStat_StatDBEntry *dbentry;
/*
* Lookup the database in the hashtable.
*/
! dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
/*
! * If found, remove it.
*/
if (dbentry)
{
if (dbentry->tables != NULL)
hash_destroy(dbentry->tables);
if (dbentry->functions != NULL)
hash_destroy(dbentry->functions);
if (hash_search(pgStatDBHash,
! (void *) &(dbentry->databaseid),
HASH_REMOVE, NULL) == NULL)
ereport(ERROR,
! (errmsg("database hash table corrupted "
! "during cleanup --- abort")));
}
}
--- 4721,4757 ----
static void
pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
{
+ Oid dbid = msg->m_databaseid;
PgStat_StatDBEntry *dbentry;
/*
* Lookup the database in the hashtable.
*/
! dbentry = pgstat_get_db_entry(dbid, false);
/*
! * If found, remove it (along with the db statfile).
*/
if (dbentry)
{
+ int statfilelen = get_dbstat_file_len(true, false, dbid);
+ char statfile[statfilelen];
+
+ get_dbstat_filename(true, false, dbid, statfile, statfilelen);
+
+ elog(DEBUG1, "removing %s", statfile);
+ unlink(statfile);
+
if (dbentry->tables != NULL)
hash_destroy(dbentry->tables);
if (dbentry->functions != NULL)
hash_destroy(dbentry->functions);
if (hash_search(pgStatDBHash,
! (void *) &dbid,
HASH_REMOVE, NULL) == NULL)
ereport(ERROR,
! (errmsg("database hash table corrupted during cleanup --- abort")));
}
}
***************
*** 4687,4689 **** pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
--- 5146,5206 ----
HASH_REMOVE, NULL);
}
}
+
+ /* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether there's a db stats request, requiring a file write.
+ *
+ * TODO Seems that thanks the way we handle last_statrequests (erase after
+ * a write), this is unnecessary. Just check that there's at least one
+ * request and you're done. Although there might be delayed requests ...
+ * ----------
+ */
+ static bool
+ pgstat_write_statsfile_needed(void)
+ {
+ PgStat_StatDBEntry *dbentry;
+ slist_iter iter;
+
+ /* Check the databases if they need to refresh the stats. */
+ slist_foreach(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur);
+
+ dbentry = pgstat_get_db_entry(req->databaseid, false);
+
+ /* No dbentry yet or too old. */
+ if (!dbentry || (dbentry->stats_timestamp < req->request_time))
+ {
+ return true;
+ }
+ }
+
+ /* Well, everything was written recently ... */
+ return false;
+ }
+
+ /* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether stats for a particular DB need to be written to a file).
+ * ----------
+ */
+
+ static bool
+ pgstat_db_requested(Oid databaseid)
+ {
+ slist_iter iter;
+
+ /* Check the databases if they need to refresh the stats. */
+ slist_foreach(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur);
+
+ if (req->databaseid == databaseid)
+ return true;
+ }
+
+ return false;
+ }
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 8704,8717 **** static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
/* check_canonical_path already canonicalized newval for us */
char *tname;
char *fname;
! tname = guc_malloc(ERROR, strlen(newval) + 12); /* /pgstat.tmp */
! sprintf(tname, "%s/pgstat.tmp", newval);
! fname = guc_malloc(ERROR, strlen(newval) + 13); /* /pgstat.stat */
! sprintf(fname, "%s/pgstat.stat", newval);
if (pgstat_stat_tmpname)
free(pgstat_stat_tmpname);
pgstat_stat_tmpname = tname;
--- 8704,8726 ----
assign_pgstat_temp_directory(const char *newval, void *extra)
{
/* check_canonical_path already canonicalized newval for us */
+ char *dname;
char *tname;
char *fname;
! /* directory */
! dname = guc_malloc(ERROR, strlen(newval) + 1); /* runtime dir */
! sprintf(dname, "%s", newval);
+ /* global stats */
+ tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
+ sprintf(tname, "%s/global.tmp", newval);
+ fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
+ sprintf(fname, "%s/global.stat", newval);
+
+ if (pgstat_stat_directory)
+ free(pgstat_stat_directory);
+ pgstat_stat_directory = dname;
if (pgstat_stat_tmpname)
free(pgstat_stat_tmpname);
pgstat_stat_tmpname = tname;
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
***************
*** 192,197 **** const char *subdirs[] = {
--- 192,198 ----
"base",
"base/1",
"pg_tblspc",
+ "pg_stat",
"pg_stat_tmp"
};
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
***************
*** 205,210 **** typedef struct PgStat_MsgInquiry
--- 205,211 ----
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB (InvalidOid => all DBs) */
} PgStat_MsgInquiry;
***************
*** 514,520 **** typedef union PgStat_Msg
* ------------------------------------------------------------
*/
! #define PGSTAT_FILE_FORMAT_ID 0x01A5BC9A
/* ----------
* PgStat_StatDBEntry The collector's data per database
--- 515,521 ----
* ------------------------------------------------------------
*/
! #define PGSTAT_FILE_FORMAT_ID 0xA240CA47
/* ----------
* PgStat_StatDBEntry The collector's data per database
***************
*** 545,550 **** typedef struct PgStat_StatDBEntry
--- 546,552 ----
PgStat_Counter n_block_write_time;
TimestampTz stat_reset_timestamp;
+ TimestampTz stats_timestamp; /* time of db stats file update */
/*
* tables and functions must be last in the struct, because we don't write
***************
*** 722,727 **** extern bool pgstat_track_activities;
--- 724,730 ----
extern bool pgstat_track_counts;
extern int pgstat_track_functions;
extern PGDLLIMPORT int pgstat_track_activity_query_size;
+ extern char *pgstat_stat_directory;
extern char *pgstat_stat_tmpname;
extern char *pgstat_stat_filename;
Here's a ninth version of this patch. (version 8 went unpublished). I
have simplified a lot of things and improved some comments; I think I
understand much of it now. I think this patch is fairly close to
committable, but one issue remains, which is this bit in
pgstat_write_statsfiles():
/* In any case, we can just throw away all the db requests, but we need to
* write dummy files for databases without a stat entry (it would cause
* issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts).
* This may happen e.g. for shared DB (oid = 0) right after initdb.
*/
if (!slist_is_empty(&last_statrequests))
{
slist_mutable_iter iter;
slist_foreach_modify(iter, &last_statrequests)
{
DBWriteRequest *req = slist_container(DBWriteRequest, next,
iter.cur);
/*
* Create dummy files for requested databases without a proper
* dbentry. It's much easier this way than dealing with multiple
* timestamps, possibly existing but not yet written DBs etc.
* */
if (!pgstat_get_db_entry(req->databaseid, false))
pgstat_write_db_dummyfile(req->databaseid);
pfree(req);
}
slist_init(&last_statrequests);
}
The problem here is that creating these dummy entries will cause a
difference in autovacuum behavior. Autovacuum will skip processing
databases with no pgstat entry, and the intended reason is that if
there's no pgstat entry it's because the database doesn't have enough
activity. Now perhaps we want to change that, but it should be an
explicit decision taken after discussion and thought, not side effect
from an unrelated patch.
Hm, and I now also realize another bug in this patch: the global stats
only include database entries for requested databases; but perhaps the
existing files can serve later requestors just fine for databases that
already had files; so the global stats file should continue to carry
entries for them, with the old timestamps.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
stats-split-v9.patchtext/x-diff; charset=us-asciiDownload
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 38,43 ****
--- 38,44 ----
#include "access/xact.h"
#include "catalog/pg_database.h"
#include "catalog/pg_proc.h"
+ #include "lib/ilist.h"
#include "libpq/ip.h"
#include "libpq/libpq.h"
#include "libpq/pqsignal.h"
***************
*** 66,73 ****
* Paths for the statistics files (relative to installation's $PGDATA).
* ----------
*/
! #define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat"
! #define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp"
/* ----------
* Timer definitions.
--- 67,75 ----
* Paths for the statistics files (relative to installation's $PGDATA).
* ----------
*/
! #define PGSTAT_STAT_PERMANENT_DIRECTORY "pg_stat"
! #define PGSTAT_STAT_PERMANENT_FILENAME "pg_stat/global.stat"
! #define PGSTAT_STAT_PERMANENT_TMPFILE "pg_stat/global.tmp"
/* ----------
* Timer definitions.
***************
*** 115,120 **** int pgstat_track_activity_query_size = 1024;
--- 117,124 ----
* Built from GUC parameter
* ----------
*/
+ char *pgstat_stat_directory = NULL;
+ int pgstat_stat_dbfile_maxlen = 0;
char *pgstat_stat_filename = NULL;
char *pgstat_stat_tmpname = NULL;
***************
*** 219,229 **** static int localNumBackends = 0;
*/
static PgStat_GlobalStats globalStats;
! /* Last time the collector successfully wrote the stats file */
! static TimestampTz last_statwrite;
! /* Latest statistics request time from backends */
! static TimestampTz last_statrequest;
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
--- 223,238 ----
*/
static PgStat_GlobalStats globalStats;
! /* Write request info for each database */
! typedef struct DBWriteRequest
! {
! Oid databaseid; /* OID of the database to write */
! TimestampTz request_time; /* timestamp of the last write request */
! slist_node next;
! } DBWriteRequest;
! /* Latest statistics request times from backends */
! static slist_head last_statrequests = SLIST_STATIC_INIT(last_statrequests);
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
***************
*** 252,262 **** static void pgstat_sighup_handler(SIGNAL_ARGS);
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
! static void pgstat_write_statsfile(bool permanent);
! static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
static HTAB *pgstat_collect_oids(Oid catalogid);
--- 261,277 ----
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
! static void pgstat_write_statsfiles(bool permanent, bool allDbs);
! static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent);
! static void pgstat_write_db_dummyfile(Oid databaseid);
! static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool deep);
! static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
+ static bool pgstat_write_statsfile_needed(void);
+ static bool pgstat_db_requested(Oid databaseid);
+
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
static HTAB *pgstat_collect_oids(Oid catalogid);
***************
*** 285,291 **** static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int le
static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
--- 300,305 ----
***************
*** 541,556 **** startup_failed:
}
/*
* pgstat_reset_all() -
*
! * Remove the stats file. This is currently used only if WAL
* recovery is needed after a crash.
*/
void
pgstat_reset_all(void)
{
! unlink(pgstat_stat_filename);
! unlink(PGSTAT_STAT_PERMANENT_FILENAME);
}
#ifdef EXEC_BACKEND
--- 555,594 ----
}
/*
+ * subroutine for pgstat_reset_all
+ */
+ static void
+ pgstat_reset_remove_files(const char *directory)
+ {
+ DIR * dir;
+ struct dirent * entry;
+ char fname[MAXPGPATH];
+
+ dir = AllocateDir(pgstat_stat_directory);
+ while ((entry = ReadDir(dir, pgstat_stat_directory)) != NULL)
+ {
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ snprintf(fname, MAXPGPATH, "%s/%s", pgstat_stat_directory,
+ entry->d_name);
+ unlink(fname);
+ }
+ FreeDir(dir);
+ }
+
+ /*
* pgstat_reset_all() -
*
! * Remove the stats files. This is currently used only if WAL
* recovery is needed after a crash.
*/
void
pgstat_reset_all(void)
{
!
! pgstat_reset_remove_files(pgstat_stat_directory);
! pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
}
#ifdef EXEC_BACKEND
***************
*** 1408,1420 **** pgstat_ping(void)
* ----------
*/
static void
! pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
pgstat_send(&msg, sizeof(msg));
}
--- 1446,1459 ----
* ----------
*/
static void
! pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
***************
*** 3004,3009 **** PgstatCollectorMain(int argc, char *argv[])
--- 3043,3049 ----
int len;
PgStat_Msg msg;
int wr;
+ bool first_write = true;
IsUnderPostmaster = true; /* we are a postmaster subprocess now */
***************
*** 3053,3069 **** PgstatCollectorMain(int argc, char *argv[])
init_ps_display("stats collector process", "", "", "");
/*
- * Arrange to write the initial status file right away
- */
- last_statrequest = GetCurrentTimestamp();
- last_statwrite = last_statrequest - 1;
-
- /*
* Read in an existing statistics stats file or initialize the stats to
* zero.
*/
pgStatRunningInCollector = true;
! pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
--- 3093,3103 ----
init_ps_display("stats collector process", "", "", "");
/*
* Read in an existing statistics stats file or initialize the stats to
* zero.
*/
pgStatRunningInCollector = true;
! pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, true);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
***************
*** 3107,3116 **** PgstatCollectorMain(int argc, char *argv[])
/*
* Write the stats file if a new request has arrived that is not
! * satisfied by existing file.
*/
! if (last_statwrite < last_statrequest)
! pgstat_write_statsfile(false);
/*
* Try to receive and process a message. This will not block,
--- 3141,3154 ----
/*
* Write the stats file if a new request has arrived that is not
! * satisfied by existing file (force writing all files if it's
! * the first write after startup).
*/
! if (first_write || pgstat_write_statsfile_needed())
! {
! pgstat_write_statsfiles(false, first_write);
! first_write = false;
! }
/*
* Try to receive and process a message. This will not block,
***************
*** 3269,3275 **** PgstatCollectorMain(int argc, char *argv[])
/*
* Save the final stats to reuse at next startup.
*/
! pgstat_write_statsfile(true);
exit(0);
}
--- 3307,3313 ----
/*
* Save the final stats to reuse at next startup.
*/
! pgstat_write_statsfiles(true, true);
exit(0);
}
***************
*** 3349,3354 **** pgstat_get_db_entry(Oid databaseid, bool create)
--- 3387,3393 ----
result->n_block_write_time = 0;
result->stat_reset_timestamp = GetCurrentTimestamp();
+ result->stats_timestamp = 0;
memset(&hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(Oid);
***************
*** 3422,3451 **** pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
/* ----------
! * pgstat_write_statsfile() -
*
* Tell the news.
! * If writing to the permanent file (happens when the collector is
! * shutting down only), remove the temporary file so that backends
* starting up under a new postmaster can't read the old data before
* the new collector is ready.
* ----------
*/
static void
! pgstat_write_statsfile(bool permanent)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
/*
* Open the statistics temp file to write out the current values.
*/
--- 3461,3492 ----
/* ----------
! * pgstat_write_statsfiles() -
*
* Tell the news.
! * If writing to the permanent files (happens when the collector is
! * shutting down only), remove the temporary files so that backends
* starting up under a new postmaster can't read the old data before
* the new collector is ready.
+ *
+ * When 'allDbs' is false, only the requested databases (listed in
+ * last_statrequests) will be written. If 'allDbs' is true, all databases
+ * will be written.
* ----------
*/
static void
! pgstat_write_statsfiles(bool permanent, bool allDbs)
{
HASH_SEQ_STATUS hstat;
PgStat_StatDBEntry *dbentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
/*
* Open the statistics temp file to write out the current values.
*/
***************
*** 3484,3523 **** pgstat_write_statsfile(bool permanent)
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
{
/*
! * Write out the DB entry including the number of live backends. We
! * don't write the tables or functions pointers, since they're of no
! * use to any other process.
*/
fputc('D', fpout);
rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
(void) rc; /* we'll check for error with ferror */
-
- /*
- * Walk through the database's access stats per table.
- */
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
- {
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Mark the end of this DB
- */
- fputc('d', fpout);
}
/*
--- 3525,3550 ----
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
{
/*
! * Write out the tables and functions into a separate file, if
! * required.
! *
! * We need to do this before the dbentry write, to ensure the
! * timestamps written to both are consistent.
! */
! if (allDbs || pgstat_db_requested(dbentry->databaseid))
! {
! elog(DEBUG1, "writing statsfile for DB %d", dbentry->databaseid);
! dbentry->stats_timestamp = globalStats.stats_timestamp;
! pgstat_write_db_statsfile(dbentry, permanent);
! }
!
! /*
! * Write out the DB entry. We don't write the tables or functions
! * pointers, since they're of no use to any other process.
*/
fputc('D', fpout);
rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
(void) rc; /* we'll check for error with ferror */
}
/*
***************
*** 3527,3532 **** pgstat_write_statsfile(bool permanent)
--- 3554,3587 ----
*/
fputc('E', fpout);
+ /* In any case, we can just throw away all the db requests, but we need to
+ * write dummy files for databases without a stat entry (it would cause
+ * issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts).
+ * This may happen e.g. for shared DB (oid = 0) right after initdb.
+ */
+ if (!slist_is_empty(&last_statrequests))
+ {
+ slist_mutable_iter iter;
+
+ slist_foreach_modify(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next,
+ iter.cur);
+
+ /*
+ * Create dummy files for requested databases without a proper
+ * dbentry. It's much easier this way than dealing with multiple
+ * timestamps, possibly existing but not yet written DBs etc.
+ * */
+ if (!pgstat_get_db_entry(req->databaseid, false))
+ pgstat_write_db_dummyfile(req->databaseid);
+
+ pfree(req);
+ }
+
+ slist_init(&last_statrequests);
+ }
+
if (ferror(fpout))
{
ereport(LOG,
***************
*** 3552,3612 **** pgstat_write_statsfile(bool permanent)
tmpfile, statfile)));
unlink(tmpfile);
}
- else
- {
- /*
- * Successful write, so update last_statwrite.
- */
- last_statwrite = globalStats.stats_timestamp;
-
- /*
- * If there is clock skew between backends and the collector, we could
- * receive a stats request time that's in the future. If so, complain
- * and reset last_statrequest. Resetting ensures that no inquiry
- * message can cause more than one stats file write to occur.
- */
- if (last_statrequest > last_statwrite)
- {
- char *reqtime;
- char *mytime;
-
- /* Copy because timestamptz_to_str returns a static buffer */
- reqtime = pstrdup(timestamptz_to_str(last_statrequest));
- mytime = pstrdup(timestamptz_to_str(last_statwrite));
- elog(LOG, "last_statrequest %s is later than collector's time %s",
- reqtime, mytime);
- pfree(reqtime);
- pfree(mytime);
-
- last_statrequest = last_statwrite;
- }
- }
if (permanent)
unlink(pgstat_stat_filename);
}
/* ----------
* pgstat_read_statsfile() -
*
* Reads in an existing statistics collector file and initializes the
! * databases' hash table (whose entries point to the tables' hash tables).
* ----------
*/
static HTAB *
! pgstat_read_statsfile(Oid onlydb, bool permanent)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
- HTAB *tabhash = NULL;
- HTAB *funchash = NULL;
FILE *fpin;
int32 format_id;
bool found;
--- 3607,3855 ----
tmpfile, statfile)));
unlink(tmpfile);
}
if (permanent)
unlink(pgstat_stat_filename);
}
+ /*
+ * return the filename for a DB stat file; filename is the output buffer,
+ * of length len.
+ */
+ static void
+ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+ char *filename, int len)
+ {
+ int printed;
+
+ printed = snprintf(filename, len, "%s/db_%u.%s",
+ permanent ? "pg_stat" : pgstat_stat_directory,
+ databaseid,
+ tempname ? "tmp" : "stat");
+ if (printed > len)
+ elog(ERROR, "overlength pgstat path");
+ }
+
+ /* ----------
+ * pgstat_write_db_statsfile() -
+ *
+ * Tell the news. This writes stats file for a single database.
+ *
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+ static void
+ pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+ {
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int32 format_id;
+ Oid dbid = dbentry->databaseid;
+ int rc;
+ char tmpfile[MAXPGPATH];
+ char statfile[MAXPGPATH];
+
+ get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
+ get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ tmpfile)));
+ return;
+ }
+
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file \"%s\": %m",
+ tmpfile)));
+ FreeFile(fpout);
+ unlink(tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file \"%s\": %m",
+ tmpfile)));
+ unlink(tmpfile);
+ }
+ else if (rename(tmpfile, statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+ tmpfile, statfile)));
+ unlink(tmpfile);
+ }
+
+ if (permanent)
+ {
+ get_dbstat_filename(false, false, dbid, tmpfile, MAXPGPATH);
+
+ elog(DEBUG1, "removing temporary stat file '%s'", tmpfile);
+ unlink(tmpfile);
+ }
+ }
+
+
+ /* ----------
+ * pgstat_write_db_dummyfile() -
+ *
+ * All this does is writing a dummy stat file for databases without dbentry
+ * yet. It basically writes just a file header - format ID and a timestamp.
+ * ----------
+ */
+ static void
+ pgstat_write_db_dummyfile(Oid databaseid)
+ {
+ FILE *fpout;
+ int32 format_id;
+ int rc;
+ char tmpfile[MAXPGPATH];
+ char statfile[MAXPGPATH];
+
+ get_dbstat_filename(false, true, databaseid, tmpfile, MAXPGPATH);
+ get_dbstat_filename(false, false, databaseid, statfile, MAXPGPATH);
+
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ tmpfile)));
+ return;
+ }
+
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * Write the timestamp.
+ */
+ rc = fwrite(&(globalStats.stats_timestamp), sizeof(globalStats.stats_timestamp), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary dummy statistics file \"%s\": %m",
+ tmpfile)));
+ FreeFile(fpout);
+ unlink(tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary dummy statistics file \"%s\": %m",
+ tmpfile)));
+ unlink(tmpfile);
+ }
+ else if (rename(tmpfile, statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary dummy statistics file \"%s\" to \"%s\": %m",
+ tmpfile, statfile)));
+ unlink(tmpfile);
+ }
+ }
/* ----------
* pgstat_read_statsfile() -
*
* Reads in an existing statistics collector file and initializes the
! * databases' hash table. If the permanent file name is requested, also
! * remove it after reading.
! *
! * If a deep read is requested, table/function stats are read also, otherwise
! * the table/function hash tables remain empty.
* ----------
*/
static HTAB *
! pgstat_read_statsfile(Oid onlydb, bool permanent, bool deep)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
HASHCTL hash_ctl;
HTAB *dbhash;
FILE *fpin;
int32 format_id;
bool found;
***************
*** 3690,3697 **** pgstat_read_statsfile(Oid onlydb, bool permanent)
{
/*
* 'D' A PgStat_StatDBEntry struct describing a database
! * follows. Subsequently, zero to many 'T' and 'F' entries
! * will follow until a 'd' is encountered.
*/
case 'D':
if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
--- 3933,3939 ----
{
/*
* 'D' A PgStat_StatDBEntry struct describing a database
! * follows.
*/
case 'D':
if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
***************
*** 3753,3773 **** pgstat_read_statsfile(Oid onlydb, bool permanent)
HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
/*
! * Arrange that following records add entries to this
! * database's hash tables.
*/
! tabhash = dbentry->tables;
! funchash = dbentry->functions;
! break;
- /*
- * 'd' End of this database.
- */
- case 'd':
- tabhash = NULL;
- funchash = NULL;
break;
/*
* 'T' A PgStat_StatTabEntry follows.
*/
--- 3995,4111 ----
HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
/*
! * If requested, read the data from the database-specific file.
! * If there was onlydb specified (!= InvalidOid), we would not
! * get here because of a break above. So we don't need to
! * recheck.
*/
! if (deep)
! pgstat_read_db_statsfile(dbentry->databaseid,
! dbentry->tables,
! dbentry->functions,
! permanent);
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+ done:
+ FreeFile(fpin);
+
+ if (permanent)
+ {
+ /*
+ * If requested to read the permanent file, also get rid of it; the
+ * in-memory status is now authoritative, and the permanent file would
+ * be out of date in case somebody else reads it.
+ */
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ }
+
+ return dbhash;
+ }
+
+
+ /* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+ static void
+ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent)
+ {
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ int32 format_id;
+ TimestampTz timestamp;
+ bool found;
+ char statfile[MAXPGPATH];
+
+ get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file \"%s\": %m",
+ statfile)));
+ return;
+ }
+
+ /*
+ * Verify it's of the expected format.
+ */
+ if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
+ || format_id != PGSTAT_FILE_FORMAT_ID)
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ goto done;
+ }
+
+ /*
+ * Read global stats struct
+ */
+ if (fread(×tamp, 1, sizeof(timestamp), fpin) != sizeof(timestamp))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ goto done;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
***************
*** 3854,3878 **** done:
FreeFile(fpin);
if (permanent)
! unlink(PGSTAT_STAT_PERMANENT_FILENAME);
! return dbhash;
}
/* ----------
! * pgstat_read_statsfile_timestamp() -
*
! * Attempt to fetch the timestamp of an existing stats file.
* Returns TRUE if successful (timestamp is stored at *ts).
* ----------
*/
static bool
! pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
{
! PgStat_GlobalStats myGlobalStats;
FILE *fpin;
int32 format_id;
! const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
/*
* Try to open the status file. As above, anything but ENOENT is worthy
--- 4192,4224 ----
FreeFile(fpin);
if (permanent)
! {
! get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
! elog(DEBUG1, "removing permanent stats file '%s'", statfile);
! unlink(statfile);
! }
!
! return;
}
+
/* ----------
! * pgstat_read_db_statsfile_timestamp() -
*
! * Attempt to fetch the timestamp of an existing stats file (for a DB).
* Returns TRUE if successful (timestamp is stored at *ts).
* ----------
*/
static bool
! pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, TimestampTz *ts)
{
! TimestampTz timestamp;
FILE *fpin;
int32 format_id;
! char statfile[MAXPGPATH];
!
! get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
/*
* Try to open the status file. As above, anything but ENOENT is worthy
***************
*** 3891,3898 **** pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
/*
* Verify it's of the expected format.
*/
! if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
! || format_id != PGSTAT_FILE_FORMAT_ID)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"", statfile)));
--- 4237,4244 ----
/*
* Verify it's of the expected format.
*/
! if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
! format_id != PGSTAT_FILE_FORMAT_ID)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"", statfile)));
***************
*** 3903,3909 **** pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
/*
* Read global stats struct
*/
! if (fread(&myGlobalStats, 1, sizeof(myGlobalStats), fpin) != sizeof(myGlobalStats))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"", statfile)));
--- 4249,4255 ----
/*
* Read global stats struct
*/
! if (fread(×tamp, 1, sizeof(TimestampTz), fpin) != sizeof(TimestampTz))
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"", statfile)));
***************
*** 3911,3917 **** pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
return false;
}
! *ts = myGlobalStats.stats_timestamp;
FreeFile(fpin);
return true;
--- 4257,4263 ----
return false;
}
! *ts = timestamp;
FreeFile(fpin);
return true;
***************
*** 3947,3953 **** backend_read_statsfile(void)
CHECK_FOR_INTERRUPTS();
! ok = pgstat_read_statsfile_timestamp(false, &file_ts);
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
--- 4293,4299 ----
CHECK_FOR_INTERRUPTS();
! ok = pgstat_read_db_statsfile_timestamp(MyDatabaseId, false, &file_ts);
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
***************
*** 4006,4012 **** backend_read_statsfile(void)
pfree(mytime);
}
! pgstat_send_inquiry(cur_ts, min_ts);
break;
}
--- 4352,4358 ----
pfree(mytime);
}
! pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
break;
}
***************
*** 4016,4022 **** backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
! pgstat_send_inquiry(cur_ts, min_ts);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
--- 4362,4368 ----
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
! pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
***************
*** 4024,4034 **** backend_read_statsfile(void)
if (count >= PGSTAT_POLL_LOOP_COUNT)
elog(WARNING, "pgstat wait timeout");
! /* Autovacuum launcher wants stats about all databases */
if (IsAutoVacuumLauncherProcess())
! pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
else
! pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
}
--- 4370,4383 ----
if (count >= PGSTAT_POLL_LOOP_COUNT)
elog(WARNING, "pgstat wait timeout");
! /*
! * Autovacuum launcher wants stats about all databases, but a shallow
! * read is sufficient.
! */
if (IsAutoVacuumLauncherProcess())
! pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, false);
else
! pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false, true);
}
***************
*** 4084,4109 **** pgstat_clear_snapshot(void)
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
/*
! * Advance last_statrequest if this requestor has a newer cutoff time
! * than any previous request.
*/
! if (msg->cutoff_time > last_statrequest)
! last_statrequest = msg->cutoff_time;
/*
! * If the requestor's local clock time is older than last_statwrite, we
* should suspect a clock glitch, ie system time going backwards; though
* the more likely explanation is just delayed message receipt. It is
* worth expending a GetCurrentTimestamp call to be sure, since a large
* retreat in the system clock reading could otherwise cause us to neglect
* to update the stats file for a long time.
*/
! if (msg->clock_time < last_statwrite)
{
TimestampTz cur_ts = GetCurrentTimestamp();
! if (cur_ts < last_statwrite)
{
/*
* Sure enough, time went backwards. Force a new stats file write
--- 4433,4485 ----
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
+ slist_iter iter;
+ bool found = false;
+ DBWriteRequest *newreq;
+ PgStat_StatDBEntry *dbentry;
+
+ elog(DEBUG1, "received inquiry for %d", msg->databaseid);
+
+ /*
+ * Find the last write request for this DB (found=true in that case). Plain
+ * linear search, not really worth doing any magic here (probably).
+ */
+ slist_foreach(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur);
+
+ if (req->databaseid != msg->databaseid)
+ continue;
+
+ if (msg->cutoff_time > req->request_time)
+ req->request_time = msg->cutoff_time;
+ found = true;
+ return;
+ }
+
/*
! * There's no request for this DB yet, so create one.
*/
! newreq = palloc(sizeof(DBWriteRequest));
!
! newreq->databaseid = msg->databaseid;
! newreq->request_time = msg->clock_time;
! slist_push_head(&last_statrequests, &newreq->next);
/*
! * If the requestor's local clock time is older than stats_timestamp, we
* should suspect a clock glitch, ie system time going backwards; though
* the more likely explanation is just delayed message receipt. It is
* worth expending a GetCurrentTimestamp call to be sure, since a large
* retreat in the system clock reading could otherwise cause us to neglect
* to update the stats file for a long time.
*/
! dbentry = pgstat_get_db_entry(msg->databaseid, false);
! if ((dbentry != NULL) && (msg->clock_time < dbentry->stats_timestamp))
{
TimestampTz cur_ts = GetCurrentTimestamp();
! if (cur_ts < dbentry->stats_timestamp)
{
/*
* Sure enough, time went backwards. Force a new stats file write
***************
*** 4113,4127 **** pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
char *mytime;
/* Copy because timestamptz_to_str returns a static buffer */
! writetime = pstrdup(timestamptz_to_str(last_statwrite));
mytime = pstrdup(timestamptz_to_str(cur_ts));
! elog(LOG, "last_statwrite %s is later than collector's time %s",
! writetime, mytime);
pfree(writetime);
pfree(mytime);
! last_statrequest = cur_ts;
! last_statwrite = last_statrequest - 1;
}
}
}
--- 4489,4504 ----
char *mytime;
/* Copy because timestamptz_to_str returns a static buffer */
! writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
mytime = pstrdup(timestamptz_to_str(cur_ts));
! elog(LOG,
! "stats_timestamp %s is later than collector's time %s for db %d",
! writetime, mytime, dbentry->databaseid);
pfree(writetime);
pfree(mytime);
! newreq->request_time = cur_ts;
! dbentry->stats_timestamp = cur_ts - 1;
}
}
}
***************
*** 4270,4298 **** pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
static void
pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
{
PgStat_StatDBEntry *dbentry;
/*
* Lookup the database in the hashtable.
*/
! dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
/*
! * If found, remove it.
*/
if (dbentry)
{
if (dbentry->tables != NULL)
hash_destroy(dbentry->tables);
if (dbentry->functions != NULL)
hash_destroy(dbentry->functions);
if (hash_search(pgStatDBHash,
! (void *) &(dbentry->databaseid),
HASH_REMOVE, NULL) == NULL)
ereport(ERROR,
! (errmsg("database hash table corrupted "
! "during cleanup --- abort")));
}
}
--- 4647,4682 ----
static void
pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
{
+ Oid dbid = msg->m_databaseid;
PgStat_StatDBEntry *dbentry;
/*
* Lookup the database in the hashtable.
*/
! dbentry = pgstat_get_db_entry(dbid, false);
/*
! * If found, remove it (along with the db statfile).
*/
if (dbentry)
{
+ char statfile[MAXPGPATH];
+
+ get_dbstat_filename(true, false, dbid, statfile, MAXPGPATH);
+
+ elog(DEBUG1, "removing %s", statfile);
+ unlink(statfile);
+
if (dbentry->tables != NULL)
hash_destroy(dbentry->tables);
if (dbentry->functions != NULL)
hash_destroy(dbentry->functions);
if (hash_search(pgStatDBHash,
! (void *) &dbid,
HASH_REMOVE, NULL) == NULL)
ereport(ERROR,
! (errmsg("database hash table corrupted during cleanup --- abort")));
}
}
***************
*** 4687,4689 **** pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
--- 5071,5131 ----
HASH_REMOVE, NULL);
}
}
+
+ /* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Checks whether there's a db stats request, requiring a file write.
+ *
+ * TODO Seems that thanks the way we handle last_statrequests (erase after
+ * a write), this is unnecessary. Just check that there's at least one
+ * request and you're done. Although there might be delayed requests ...
+ * ----------
+ */
+ static bool
+ pgstat_write_statsfile_needed(void)
+ {
+ PgStat_StatDBEntry *dbentry;
+ slist_iter iter;
+
+ /* Check the databases if they need to refresh the stats. */
+ slist_foreach(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur);
+
+ dbentry = pgstat_get_db_entry(req->databaseid, false);
+
+ /* No dbentry yet or too old. */
+ if (!dbentry || (dbentry->stats_timestamp < req->request_time))
+ {
+ return true;
+ }
+ }
+
+ /* Well, everything was written recently ... */
+ return false;
+ }
+
+ /* ----------
+ * pgstat_db_requested() -
+ *
+ * Checks whether stats for a particular DB need to be written to a file.
+ * ----------
+ */
+
+ static bool
+ pgstat_db_requested(Oid databaseid)
+ {
+ slist_iter iter;
+
+ /* Check the databases if they need to refresh the stats. */
+ slist_foreach(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur);
+
+ if (req->databaseid == databaseid)
+ return true;
+ }
+
+ return false;
+ }
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 8704,8717 **** static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
/* check_canonical_path already canonicalized newval for us */
char *tname;
char *fname;
! tname = guc_malloc(ERROR, strlen(newval) + 12); /* /pgstat.tmp */
! sprintf(tname, "%s/pgstat.tmp", newval);
! fname = guc_malloc(ERROR, strlen(newval) + 13); /* /pgstat.stat */
! sprintf(fname, "%s/pgstat.stat", newval);
if (pgstat_stat_tmpname)
free(pgstat_stat_tmpname);
pgstat_stat_tmpname = tname;
--- 8704,8728 ----
assign_pgstat_temp_directory(const char *newval, void *extra)
{
/* check_canonical_path already canonicalized newval for us */
+ char *dname;
char *tname;
char *fname;
! /* directory */
! dname = guc_malloc(ERROR, strlen(newval) + 1); /* runtime dir */
! sprintf(dname, "%s", newval);
+ /* global stats */
+ tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
+ sprintf(tname, "%s/global.tmp", newval);
+ fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
+ sprintf(fname, "%s/global.stat", newval);
+
+ if (pgstat_stat_directory)
+ free(pgstat_stat_directory);
+ pgstat_stat_directory = dname;
+ /* invalidate cached length in pgstat.c */
+ pgstat_stat_dbfile_maxlen = 0;
if (pgstat_stat_tmpname)
free(pgstat_stat_tmpname);
pgstat_stat_tmpname = tname;
*** a/src/bin/initdb/initdb.c
--- b/src/bin/initdb/initdb.c
***************
*** 192,197 **** const char *subdirs[] = {
--- 192,198 ----
"base",
"base/1",
"pg_tblspc",
+ "pg_stat",
"pg_stat_tmp"
};
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
***************
*** 205,210 **** typedef struct PgStat_MsgInquiry
--- 205,211 ----
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB (InvalidOid => all DBs) */
} PgStat_MsgInquiry;
***************
*** 514,520 **** typedef union PgStat_Msg
* ------------------------------------------------------------
*/
! #define PGSTAT_FILE_FORMAT_ID 0x01A5BC9A
/* ----------
* PgStat_StatDBEntry The collector's data per database
--- 515,521 ----
* ------------------------------------------------------------
*/
! #define PGSTAT_FILE_FORMAT_ID 0xA240CA47
/* ----------
* PgStat_StatDBEntry The collector's data per database
***************
*** 545,550 **** typedef struct PgStat_StatDBEntry
--- 546,552 ----
PgStat_Counter n_block_write_time;
TimestampTz stat_reset_timestamp;
+ TimestampTz stats_timestamp; /* time of db stats file update */
/*
* tables and functions must be last in the struct, because we don't write
***************
*** 722,727 **** extern bool pgstat_track_activities;
--- 724,731 ----
extern bool pgstat_track_counts;
extern int pgstat_track_functions;
extern PGDLLIMPORT int pgstat_track_activity_query_size;
+ extern char *pgstat_stat_directory;
+ extern int pgstat_stat_dbfile_maxlen;
extern char *pgstat_stat_tmpname;
extern char *pgstat_stat_filename;
Alvaro Herrera escribió:
Hm, and I now also realize another bug in this patch: the global stats
only include database entries for requested databases; but perhaps the
existing files can serve later requestors just fine for databases that
already had files; so the global stats file should continue to carry
entries for them, with the old timestamps.
Actually the code already do things that way -- apologies.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I saw discussion about this on this thread, but I'm not able to figure
out what the answer is: how does this work with moving the stats file,
for example to a RAMdisk? Specifically, if the user sets
stats_temp_directory, does it continue to work the way it does now?
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Josh Berkus wrote:
I saw discussion about this on this thread, but I'm not able to figure
out what the answer is: how does this work with moving the stats file,
for example to a RAMdisk? Specifically, if the user sets
stats_temp_directory, does it continue to work the way it does now?
Of course. You get more files than previously, but yes.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera escribió:
Here's a ninth version of this patch. (version 8 went unpublished). I
have simplified a lot of things and improved some comments; I think I
understand much of it now. I think this patch is fairly close to
committable, but one issue remains, which is this bit in
pgstat_write_statsfiles():/* In any case, we can just throw away all the db requests, but we need to
* write dummy files for databases without a stat entry (it would cause
* issues in pgstat_read_db_statsfile_timestamp and pgstat wait timeouts).
* This may happen e.g. for shared DB (oid = 0) right after initdb.
*/
I think the real way to handle this is to fix backend_read_statsfile().
It's using the old logic of considering existance of the file, but of
course now the file might not exist at all and that doesn't mean we need
to continue kicking the collector to write it. We need a mechanism to
figure that the collector is just not going to write the file no matter
how hard we kick it ...
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera escribió:
Here's a ninth version of this patch. (version 8 went unpublished). I
have simplified a lot of things and improved some comments; I think I
understand much of it now. I think this patch is fairly close to
committable, but one issue remains, which is this bit in
pgstat_write_statsfiles():
I've marked this as Waiting on author for the time being. I'm going to
review/work on other patches now, hoping that Tomas will post an updated
version in time for it to be considered for 9.3.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14.2.2013 20:43, Josh Berkus wrote:
I saw discussion about this on this thread, but I'm not able to figure
out what the answer is: how does this work with moving the stats file,
for example to a RAMdisk? Specifically, if the user sets
stats_temp_directory, does it continue to work the way it does now?
No change in this respect - you can still use RAMdisk, and you'll
actually need less space because the space requirements decreased due to
breaking the single file into multiple pieces.
We're using it this way (on a tmpfs filesystem) and it works like a charm.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
First of all, big thanks for working on this patch and not only
identifying the issues but actually fixing them.
On 14.2.2013 20:23, Alvaro Herrera wrote:
Here's a ninth version of this patch. (version 8 went unpublished). I
have simplified a lot of things and improved some comments; I think I
understand much of it now. I think this patch is fairly close to
committable, but one issue remains, which is this bit in
pgstat_write_statsfiles():
...
The problem here is that creating these dummy entries will cause a
difference in autovacuum behavior. Autovacuum will skip processing
databases with no pgstat entry, and the intended reason is that if
there's no pgstat entry it's because the database doesn't have enough
activity. Now perhaps we want to change that, but it should be an
explicit decision taken after discussion and thought, not side effect
from an unrelated patch.
I don't see how that changes the autovacuum behavior. Can you explain
that a bit more?
As I see it, with the old (single-file version) the autovacuum worker
would get exacly the same thing, i.e. no stats at all.
Which is exacly what autovacuum worker gets with the new code, except
that the check for last statfile timestamp uses the "per-db" file, so we
need to write it. This way the worker is able to read the timestamp, is
happy about it because it gets a fresh file although it gets no stats later.
Where is the behavior change? Can you provide an example?
kind regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 14.2.2013 22:24, Alvaro Herrera wrote:
Alvaro Herrera escribió:
Here's a ninth version of this patch. (version 8 went unpublished). I
have simplified a lot of things and improved some comments; I think I
understand much of it now. I think this patch is fairly close to
committable, but one issue remains, which is this bit in
pgstat_write_statsfiles():I've marked this as Waiting on author for the time being. I'm going to
review/work on other patches now, hoping that Tomas will post an updated
version in time for it to be considered for 9.3.
Sadly I have no idea how to fix that, and I think the solution you
suggested in the previous messages does not actually do the trick :-(
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tomas Vondra escribió:
I don't see how that changes the autovacuum behavior. Can you explain
that a bit more?
It might be that I'm all wet on this. I'll poke at it some more.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tomas Vondra escribió:
On 14.2.2013 20:23, Alvaro Herrera wrote:
The problem here is that creating these dummy entries will cause a
difference in autovacuum behavior. Autovacuum will skip processing
databases with no pgstat entry, and the intended reason is that if
there's no pgstat entry it's because the database doesn't have enough
activity. Now perhaps we want to change that, but it should be an
explicit decision taken after discussion and thought, not side effect
from an unrelated patch.I don't see how that changes the autovacuum behavior. Can you explain
that a bit more?As I see it, with the old (single-file version) the autovacuum worker
would get exacly the same thing, i.e. no stats at all.
See in autovacuum.c the calls to pgstat_fetch_stat_dbentry(). Most of
them check for NULL result and act differently depending on that.
Returning a valid (not NULL) entry full of zeroes is not the same.
I didn't actually try to reproduce a problem.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15.2.2013 16:38, Alvaro Herrera wrote:
Tomas Vondra escribió:
On 14.2.2013 20:23, Alvaro Herrera wrote:
The problem here is that creating these dummy entries will cause a
difference in autovacuum behavior. Autovacuum will skip processing
databases with no pgstat entry, and the intended reason is that if
there's no pgstat entry it's because the database doesn't have enough
activity. Now perhaps we want to change that, but it should be an
explicit decision taken after discussion and thought, not side effect
from an unrelated patch.I don't see how that changes the autovacuum behavior. Can you explain
that a bit more?As I see it, with the old (single-file version) the autovacuum worker
would get exacly the same thing, i.e. no stats at all.See in autovacuum.c the calls to pgstat_fetch_stat_dbentry(). Most of
them check for NULL result and act differently depending on that.
Returning a valid (not NULL) entry full of zeroes is not the same.
I didn't actually try to reproduce a problem.
Errrr, but why would the patched code return entry full of zeroes and
not NULL as before? The dummy files serve single purpose - confirm that
the collector attempted to write info for the particular database (and
did not found any data for that).
All it contains is a timestamp of the write - nothing else. So the
worker will read the global file (containing list of stats for dbs) and
then will get NULL just like the old code. Because the database is not
there and the patch does not change that at all.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15.2.2013 01:02, Tomas Vondra wrote:
On 14.2.2013 22:24, Alvaro Herrera wrote:
Alvaro Herrera escribió:
Here's a ninth version of this patch. (version 8 went unpublished). I
have simplified a lot of things and improved some comments; I think I
understand much of it now. I think this patch is fairly close to
committable, but one issue remains, which is this bit in
pgstat_write_statsfiles():I've marked this as Waiting on author for the time being. I'm going to
review/work on other patches now, hoping that Tomas will post an updated
version in time for it to be considered for 9.3.Sadly I have no idea how to fix that, and I think the solution you
suggested in the previous messages does not actually do the trick :-(
I've been thinking about this (actually I had a really weird dream about
it this night) and I think it might work like this:
(1) check the timestamp of the global file -> if it's too old, we need
to send an inquiry or wait a bit longer
(2) if it's new enough, we need to read it a look for that particular
database - if it's not found, we have no info about it yet (this is
the case handled by the dummy files)
(3) if there's a database stat entry, we need to check the timestamp
when it was written for the last time -> if it's too old, send an
inquiry and wait a bit longer
(4) well, we have a recent global file, it contains the database stat
entry and it's fresh enough -> tadaaaaaa, we're done
At least that's my idea - I haven't tried to implement it yet.
I see a few pros and cons of this approach:
pros:
* no dummy files
* no timestamps in the per-db files (and thus no sync issues)
cons:
* the backends / workers will have to re-read the global file just to
check that the per-db file is there and is fresh enough
So far it was sufficient just to peek at the timestamp at the beginning
of the per-db stat file - minimum data read, no CPU-expensive processing
etc. Sadly the more DBs there are, the larger the file get (thus more
overhead to read it).
OTOH it's not that much data (~180B per entry, so with a 1000 of dbs
it's just ~180kB) so I don't expect this to be a tremendous issue. And
the pros seem to be quite compelling.
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tomas Vondra wrote:
I've been thinking about this (actually I had a really weird dream about
it this night) and I think it might work like this:(1) check the timestamp of the global file -> if it's too old, we need
to send an inquiry or wait a bit longer(2) if it's new enough, we need to read it a look for that particular
database - if it's not found, we have no info about it yet (this is
the case handled by the dummy files)(3) if there's a database stat entry, we need to check the timestamp
when it was written for the last time -> if it's too old, send an
inquiry and wait a bit longer(4) well, we have a recent global file, it contains the database stat
entry and it's fresh enough -> tadaaaaaa, we're done
Hmm, yes, I think this is what I was imagining. I had even considered
that the timestamp would be removed from the per-db file as you suggest
here.
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17.2.2013 06:46, Alvaro Herrera wrote:
Tomas Vondra wrote:
I've been thinking about this (actually I had a really weird dream about
it this night) and I think it might work like this:(1) check the timestamp of the global file -> if it's too old, we need
to send an inquiry or wait a bit longer(2) if it's new enough, we need to read it a look for that particular
database - if it's not found, we have no info about it yet (this is
the case handled by the dummy files)(3) if there's a database stat entry, we need to check the timestamp
when it was written for the last time -> if it's too old, send an
inquiry and wait a bit longer(4) well, we have a recent global file, it contains the database stat
entry and it's fresh enough -> tadaaaaaa, we're doneHmm, yes, I think this is what I was imagining. I had even considered
that the timestamp would be removed from the per-db file as you suggest
here.
So, here's v10 of the patch (based on the v9+v9a), that implements the
approach described above.
It turned out to be much easier than I expected (basically just a
rewrite of the pgstat_read_db_statsfile_timestamp() function.
I've done a fair amount of testing (and will do some more next week) but
it seems to work just fine - no errors, no measurable decrease of
performance etc.
regards
Tomas Vondra
Attachments:
stats-split-v10.patchtext/x-diff; name=stats-split-v10.patchDownload
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 9b92ebb..36c0d8b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -38,6 +38,7 @@
#include "access/xact.h"
#include "catalog/pg_database.h"
#include "catalog/pg_proc.h"
+#include "lib/ilist.h"
#include "libpq/ip.h"
#include "libpq/libpq.h"
#include "libpq/pqsignal.h"
@@ -66,8 +67,9 @@
* Paths for the statistics files (relative to installation's $PGDATA).
* ----------
*/
-#define PGSTAT_STAT_PERMANENT_FILENAME "global/pgstat.stat"
-#define PGSTAT_STAT_PERMANENT_TMPFILE "global/pgstat.tmp"
+#define PGSTAT_STAT_PERMANENT_DIRECTORY "pg_stat"
+#define PGSTAT_STAT_PERMANENT_FILENAME "pg_stat/global.stat"
+#define PGSTAT_STAT_PERMANENT_TMPFILE "pg_stat/global.tmp"
/* ----------
* Timer definitions.
@@ -115,6 +117,8 @@ int pgstat_track_activity_query_size = 1024;
* Built from GUC parameter
* ----------
*/
+char *pgstat_stat_directory = NULL;
+int pgstat_stat_dbfile_maxlen = 0;
char *pgstat_stat_filename = NULL;
char *pgstat_stat_tmpname = NULL;
@@ -219,11 +223,16 @@ static int localNumBackends = 0;
*/
static PgStat_GlobalStats globalStats;
-/* Last time the collector successfully wrote the stats file */
-static TimestampTz last_statwrite;
+/* Write request info for each database */
+typedef struct DBWriteRequest
+{
+ Oid databaseid; /* OID of the database to write */
+ TimestampTz request_time; /* timestamp of the last write request */
+ slist_node next;
+} DBWriteRequest;
-/* Latest statistics request time from backends */
-static TimestampTz last_statrequest;
+/* Latest statistics request times from backends */
+static slist_head last_statrequests = SLIST_STATIC_INIT(last_statrequests);
static volatile bool need_exit = false;
static volatile bool got_SIGHUP = false;
@@ -252,11 +261,16 @@ static void pgstat_sighup_handler(SIGNAL_ARGS);
static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
Oid tableoid, bool create);
-static void pgstat_write_statsfile(bool permanent);
-static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent);
+static void pgstat_write_statsfiles(bool permanent, bool allDbs);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent);
+static HTAB *pgstat_read_statsfile(Oid onlydb, bool permanent, bool deep);
+static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
static void backend_read_statsfile(void);
static void pgstat_read_current_status(void);
+static bool pgstat_write_statsfile_needed(void);
+static bool pgstat_db_requested(Oid databaseid);
+
static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
static void pgstat_send_funcstats(void);
static HTAB *pgstat_collect_oids(Oid catalogid);
@@ -285,7 +299,6 @@ static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int le
static void pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len);
static void pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len);
-
/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
@@ -541,16 +554,40 @@ startup_failed:
}
/*
+ * subroutine for pgstat_reset_all
+ */
+static void
+pgstat_reset_remove_files(const char *directory)
+{
+ DIR * dir;
+ struct dirent * entry;
+ char fname[MAXPGPATH];
+
+ dir = AllocateDir(pgstat_stat_directory);
+ while ((entry = ReadDir(dir, pgstat_stat_directory)) != NULL)
+ {
+ if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
+ continue;
+
+ snprintf(fname, MAXPGPATH, "%s/%s", pgstat_stat_directory,
+ entry->d_name);
+ unlink(fname);
+ }
+ FreeDir(dir);
+}
+
+/*
* pgstat_reset_all() -
*
- * Remove the stats file. This is currently used only if WAL
+ * Remove the stats files. This is currently used only if WAL
* recovery is needed after a crash.
*/
void
pgstat_reset_all(void)
{
- unlink(pgstat_stat_filename);
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+
+ pgstat_reset_remove_files(pgstat_stat_directory);
+ pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
}
#ifdef EXEC_BACKEND
@@ -1408,13 +1445,14 @@ pgstat_ping(void)
* ----------
*/
static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time)
+pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
{
PgStat_MsgInquiry msg;
pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
msg.clock_time = clock_time;
msg.cutoff_time = cutoff_time;
+ msg.databaseid = databaseid;
pgstat_send(&msg, sizeof(msg));
}
@@ -3004,6 +3042,7 @@ PgstatCollectorMain(int argc, char *argv[])
int len;
PgStat_Msg msg;
int wr;
+ bool first_write = true;
IsUnderPostmaster = true; /* we are a postmaster subprocess now */
@@ -3053,17 +3092,11 @@ PgstatCollectorMain(int argc, char *argv[])
init_ps_display("stats collector process", "", "", "");
/*
- * Arrange to write the initial status file right away
- */
- last_statrequest = GetCurrentTimestamp();
- last_statwrite = last_statrequest - 1;
-
- /*
* Read in an existing statistics stats file or initialize the stats to
* zero.
*/
pgStatRunningInCollector = true;
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, true);
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, true, true);
/*
* Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -3107,10 +3140,14 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Write the stats file if a new request has arrived that is not
- * satisfied by existing file.
+ * satisfied by existing file (force writing all files if it's
+ * the first write after startup).
*/
- if (last_statwrite < last_statrequest)
- pgstat_write_statsfile(false);
+ if (first_write || pgstat_write_statsfile_needed())
+ {
+ pgstat_write_statsfiles(false, first_write);
+ first_write = false;
+ }
/*
* Try to receive and process a message. This will not block,
@@ -3269,7 +3306,7 @@ PgstatCollectorMain(int argc, char *argv[])
/*
* Save the final stats to reuse at next startup.
*/
- pgstat_write_statsfile(true);
+ pgstat_write_statsfiles(true, true);
exit(0);
}
@@ -3349,6 +3386,7 @@ pgstat_get_db_entry(Oid databaseid, bool create)
result->n_block_write_time = 0;
result->stat_reset_timestamp = GetCurrentTimestamp();
+ result->stats_timestamp = 0;
memset(&hash_ctl, 0, sizeof(hash_ctl));
hash_ctl.keysize = sizeof(Oid);
@@ -3422,30 +3460,32 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
/* ----------
- * pgstat_write_statsfile() -
+ * pgstat_write_statsfiles() -
*
* Tell the news.
- * If writing to the permanent file (happens when the collector is
- * shutting down only), remove the temporary file so that backends
+ * If writing to the permanent files (happens when the collector is
+ * shutting down only), remove the temporary files so that backends
* starting up under a new postmaster can't read the old data before
* the new collector is ready.
+ *
+ * When 'allDbs' is false, only the requested databases (listed in
+ * last_statrequests) will be written; otherwise, all databases will be
+ * written.
* ----------
*/
static void
-pgstat_write_statsfile(bool permanent)
+pgstat_write_statsfiles(bool permanent, bool allDbs)
{
HASH_SEQ_STATUS hstat;
- HASH_SEQ_STATUS tstat;
- HASH_SEQ_STATUS fstat;
PgStat_StatDBEntry *dbentry;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatFuncEntry *funcentry;
FILE *fpout;
int32 format_id;
const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
int rc;
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
/*
* Open the statistics temp file to write out the current values.
*/
@@ -3484,40 +3524,26 @@ pgstat_write_statsfile(bool permanent)
while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
{
/*
- * Write out the DB entry including the number of live backends. We
- * don't write the tables or functions pointers, since they're of no
- * use to any other process.
+ * Write out the tables and functions into a separate file, if
+ * required.
+ *
+ * We need to do this before the dbentry write, to ensure the
+ * timestamps written to both are consistent.
*/
- fputc('D', fpout);
- rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
-
- /*
- * Walk through the database's access stats per table.
- */
- hash_seq_init(&tstat, dbentry->tables);
- while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
- {
- fputc('T', fpout);
- rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
- }
-
- /*
- * Walk through the database's function stats table.
- */
- hash_seq_init(&fstat, dbentry->functions);
- while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ if (allDbs || pgstat_db_requested(dbentry->databaseid))
{
- fputc('F', fpout);
- rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
- (void) rc; /* we'll check for error with ferror */
+ elog(DEBUG1, "writing statsfile for DB %d", dbentry->databaseid);
+ dbentry->stats_timestamp = globalStats.stats_timestamp;
+ pgstat_write_db_statsfile(dbentry, permanent);
}
/*
- * Mark the end of this DB
+ * Write out the DB entry. We don't write the tables or functions
+ * pointers, since they're of no use to any other process.
*/
- fputc('d', fpout);
+ fputc('D', fpout);
+ rc = fwrite(dbentry, offsetof(PgStat_StatDBEntry, tables), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
}
/*
@@ -3527,6 +3553,25 @@ pgstat_write_statsfile(bool permanent)
*/
fputc('E', fpout);
+ /*
+ * Now throw away the list of requests. Note that requests sent after we
+ * started the write are still waiting on the network socket.
+ */
+ if (!slist_is_empty(&last_statrequests))
+ {
+ slist_mutable_iter iter;
+
+ slist_foreach_modify(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next,
+ iter.cur);
+
+ pfree(req);
+ }
+
+ slist_init(&last_statrequests);
+ }
+
if (ferror(fpout))
{
ereport(LOG,
@@ -3552,61 +3597,161 @@ pgstat_write_statsfile(bool permanent)
tmpfile, statfile)));
unlink(tmpfile);
}
- else
+
+ if (permanent)
+ unlink(pgstat_stat_filename);
+}
+
+/*
+ * return the filename for a DB stat file; filename is the output buffer,
+ * of length len.
+ */
+static void
+get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+ char *filename, int len)
+{
+ int printed;
+
+ printed = snprintf(filename, len, "%s/db_%u.%s",
+ permanent ? "pg_stat" : pgstat_stat_directory,
+ databaseid,
+ tempname ? "tmp" : "stat");
+ if (printed > len)
+ elog(ERROR, "overlength pgstat path");
+}
+
+/* ----------
+ * pgstat_write_db_statsfile() -
+ *
+ * Tell the news. This writes stats file for a single database.
+ *
+ * If writing to the permanent file (happens when the collector is
+ * shutting down only), remove the temporary file so that backends
+ * starting up under a new postmaster can't read the old data before
+ * the new collector is ready.
+ * ----------
+ */
+static void
+pgstat_write_db_statsfile(PgStat_StatDBEntry * dbentry, bool permanent)
+{
+ HASH_SEQ_STATUS tstat;
+ HASH_SEQ_STATUS fstat;
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpout;
+ int32 format_id;
+ Oid dbid = dbentry->databaseid;
+ int rc;
+ char tmpfile[MAXPGPATH];
+ char statfile[MAXPGPATH];
+
+ get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
+ get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+
+ elog(DEBUG1, "writing statsfile '%s'", statfile);
+
+ /*
+ * Open the statistics temp file to write out the current values.
+ */
+ fpout = AllocateFile(tmpfile, PG_BINARY_W);
+ if (fpout == NULL)
{
- /*
- * Successful write, so update last_statwrite.
- */
- last_statwrite = globalStats.stats_timestamp;
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not open temporary statistics file \"%s\": %m",
+ tmpfile)));
+ return;
+ }
- /*
- * If there is clock skew between backends and the collector, we could
- * receive a stats request time that's in the future. If so, complain
- * and reset last_statrequest. Resetting ensures that no inquiry
- * message can cause more than one stats file write to occur.
- */
- if (last_statrequest > last_statwrite)
- {
- char *reqtime;
- char *mytime;
+ /*
+ * Write the file header --- currently just a format ID.
+ */
+ format_id = PGSTAT_FILE_FORMAT_ID;
+ rc = fwrite(&format_id, sizeof(format_id), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
- /* Copy because timestamptz_to_str returns a static buffer */
- reqtime = pstrdup(timestamptz_to_str(last_statrequest));
- mytime = pstrdup(timestamptz_to_str(last_statwrite));
- elog(LOG, "last_statrequest %s is later than collector's time %s",
- reqtime, mytime);
- pfree(reqtime);
- pfree(mytime);
+ /*
+ * Walk through the database's access stats per table.
+ */
+ hash_seq_init(&tstat, dbentry->tables);
+ while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+ {
+ fputc('T', fpout);
+ rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
- last_statrequest = last_statwrite;
- }
+ /*
+ * Walk through the database's function stats table.
+ */
+ hash_seq_init(&fstat, dbentry->functions);
+ while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+ {
+ fputc('F', fpout);
+ rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
+ (void) rc; /* we'll check for error with ferror */
+ }
+
+ /*
+ * No more output to be done. Close the temp file and replace the old
+ * pgstat.stat with it. The ferror() check replaces testing for error
+ * after each individual fputc or fwrite above.
+ */
+ fputc('E', fpout);
+
+ if (ferror(fpout))
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not write temporary statistics file \"%s\": %m",
+ tmpfile)));
+ FreeFile(fpout);
+ unlink(tmpfile);
+ }
+ else if (FreeFile(fpout) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not close temporary statistics file \"%s\": %m",
+ tmpfile)));
+ unlink(tmpfile);
+ }
+ else if (rename(tmpfile, statfile) < 0)
+ {
+ ereport(LOG,
+ (errcode_for_file_access(),
+ errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
+ tmpfile, statfile)));
+ unlink(tmpfile);
}
if (permanent)
- unlink(pgstat_stat_filename);
-}
+ {
+ get_dbstat_filename(false, false, dbid, tmpfile, MAXPGPATH);
+ elog(DEBUG1, "removing temporary stat file '%s'", tmpfile);
+ unlink(tmpfile);
+ }
+}
/* ----------
* pgstat_read_statsfile() -
*
* Reads in an existing statistics collector file and initializes the
- * databases' hash table (whose entries point to the tables' hash tables).
+ * databases' hash table. If the permanent file name is requested, also
+ * remove it after reading.
+ *
+ * If a deep read is requested, table/function stats are read also, otherwise
+ * the table/function hash tables remain empty.
* ----------
*/
static HTAB *
-pgstat_read_statsfile(Oid onlydb, bool permanent)
+pgstat_read_statsfile(Oid onlydb, bool permanent, bool deep)
{
PgStat_StatDBEntry *dbentry;
PgStat_StatDBEntry dbbuf;
- PgStat_StatTabEntry *tabentry;
- PgStat_StatTabEntry tabbuf;
- PgStat_StatFuncEntry funcbuf;
- PgStat_StatFuncEntry *funcentry;
HASHCTL hash_ctl;
HTAB *dbhash;
- HTAB *tabhash = NULL;
- HTAB *funchash = NULL;
FILE *fpin;
int32 format_id;
bool found;
@@ -3662,8 +3807,8 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
/*
* Verify it's of the expected format.
*/
- if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
- || format_id != PGSTAT_FILE_FORMAT_ID)
+ if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
+ format_id != PGSTAT_FILE_FORMAT_ID)
{
ereport(pgStatRunningInCollector ? LOG : WARNING,
(errmsg("corrupted statistics file \"%s\"", statfile)));
@@ -3690,8 +3835,7 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
{
/*
* 'D' A PgStat_StatDBEntry struct describing a database
- * follows. Subsequently, zero to many 'T' and 'F' entries
- * will follow until a 'd' is encountered.
+ * follows.
*/
case 'D':
if (fread(&dbbuf, 1, offsetof(PgStat_StatDBEntry, tables),
@@ -3753,21 +3897,106 @@ pgstat_read_statsfile(Oid onlydb, bool permanent)
HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
/*
- * Arrange that following records add entries to this
- * database's hash tables.
+ * If requested, read the data from the database-specific file.
+ * If there was onlydb specified (!= InvalidOid), we would not
+ * get here because of a break above. So we don't need to
+ * recheck.
*/
- tabhash = dbentry->tables;
- funchash = dbentry->functions;
- break;
+ if (deep)
+ pgstat_read_db_statsfile(dbentry->databaseid,
+ dbentry->tables,
+ dbentry->functions,
+ permanent);
- /*
- * 'd' End of this database.
- */
- case 'd':
- tabhash = NULL;
- funchash = NULL;
break;
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+done:
+ FreeFile(fpin);
+
+ if (permanent)
+ {
+ /*
+ * If requested to read the permanent file, also get rid of it; the
+ * in-memory status is now authoritative, and the permanent file would
+ * be out of date in case somebody else reads it.
+ */
+ unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ }
+
+ return dbhash;
+}
+
+
+/* ----------
+ * pgstat_read_db_statsfile() -
+ *
+ * Reads in an existing statistics collector db file and initializes the
+ * tables and functions hash tables (for the database identified by Oid).
+ * ----------
+ */
+static void
+pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent)
+{
+ PgStat_StatTabEntry *tabentry;
+ PgStat_StatTabEntry tabbuf;
+ PgStat_StatFuncEntry funcbuf;
+ PgStat_StatFuncEntry *funcentry;
+ FILE *fpin;
+ int32 format_id;
+ bool found;
+ char statfile[MAXPGPATH];
+
+ get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+
+ /*
+ * Try to open the status file. If it doesn't exist, the backends simply
+ * return zero for anything and the collector simply starts from scratch
+ * with empty counters.
+ *
+ * ENOENT is a possibility if the stats collector is not running or has
+ * not yet written the stats file the first time. Any other failure
+ * condition is suspicious.
+ */
+ if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
+ {
+ if (errno != ENOENT)
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errcode_for_file_access(),
+ errmsg("could not open statistics file \"%s\": %m",
+ statfile)));
+ return;
+ }
+
+ /*
+ * Verify it's of the expected format.
+ */
+ if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id)
+ || format_id != PGSTAT_FILE_FORMAT_ID)
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"", statfile)));
+ goto done;
+ }
+
+ /*
+ * We found an existing collector stats file. Read it and put all the
+ * hashtable entries into place.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
/*
* 'T' A PgStat_StatTabEntry follows.
*/
@@ -3854,24 +4083,41 @@ done:
FreeFile(fpin);
if (permanent)
- unlink(PGSTAT_STAT_PERMANENT_FILENAME);
+ {
+ get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
- return dbhash;
+ elog(DEBUG1, "removing permanent stats file '%s'", statfile);
+ unlink(statfile);
+ }
+
+ return;
}
/* ----------
- * pgstat_read_statsfile_timestamp() -
+ * pgstat_read_db_statsfile_timestamp() -
*
- * Attempt to fetch the timestamp of an existing stats file.
+ * Attempt to determine the timestamp of the last db statfile write.
* Returns TRUE if successful (timestamp is stored at *ts).
+ *
+ * This needs to be careful about handling databases without statfiles,
+ * i.e. databases without stat entry or not yet written. The
+ *
+ * - if there's a db stat entry, return the corresponding stats_timestamp
+ * (which may be 0 if it was not yet written, which results in writing it)
+ *
+ * - if there's no db stat entry (e.g. for a new or inactive database), there's
+ * no stat_timestamp but also nothing to write so we return timestamp of the
+ * global statfile
* ----------
*/
static bool
-pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
+pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent, TimestampTz *ts)
{
+ PgStat_StatDBEntry dbentry;
PgStat_GlobalStats myGlobalStats;
FILE *fpin;
int32 format_id;
+
const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
/*
@@ -3911,12 +4157,58 @@ pgstat_read_statsfile_timestamp(bool permanent, TimestampTz *ts)
return false;
}
+ /* By default, we're going to return the timestamp of the global file. */
*ts = myGlobalStats.stats_timestamp;
+ /*
+ * We found an existing collector stats file. Read it and look for a record
+ * for database with (OID = databaseid) - if found, use it's timestamp.
+ */
+ for (;;)
+ {
+ switch (fgetc(fpin))
+ {
+ /*
+ * 'D' A PgStat_StatDBEntry struct describing a database
+ * follows.
+ */
+ case 'D':
+
+ if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
+ fpin) != offsetof(PgStat_StatDBEntry, tables))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+
+ /* Is this the DB we're looking for? */
+ if (dbentry.databaseid == databaseid) {
+ *ts = dbentry.stats_timestamp;
+ goto done;
+ }
+
+ break;
+
+ case 'E':
+ goto done;
+
+ default:
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ goto done;
+ }
+ }
+
+
+done:
FreeFile(fpin);
return true;
}
+
/*
* If not already done, read the statistics collector stats file into
* some hash tables. The results will be kept until pgstat_clear_snapshot()
@@ -3947,7 +4239,19 @@ backend_read_statsfile(void)
CHECK_FOR_INTERRUPTS();
- ok = pgstat_read_statsfile_timestamp(false, &file_ts);
+ ok = pgstat_read_db_statsfile_timestamp(MyDatabaseId, false, &file_ts);
+
+ if (!ok)
+ {
+ /*
+ * see if the global file exists; if it does, then failure to read
+ * the db-specific file only means that there's no entry in the
+ * collector for it. If so, break out of here, because the file is
+ * not going to magically show up.
+ */
+
+
+ }
cur_ts = GetCurrentTimestamp();
/* Calculate min acceptable timestamp, if we didn't already */
@@ -4006,7 +4310,7 @@ backend_read_statsfile(void)
pfree(mytime);
}
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
break;
}
@@ -4016,7 +4320,7 @@ backend_read_statsfile(void)
/* Not there or too old, so kick the collector and wait a bit */
if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
- pgstat_send_inquiry(cur_ts, min_ts);
+ pgstat_send_inquiry(cur_ts, min_ts, MyDatabaseId);
pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
}
@@ -4024,11 +4328,14 @@ backend_read_statsfile(void)
if (count >= PGSTAT_POLL_LOOP_COUNT)
elog(WARNING, "pgstat wait timeout");
- /* Autovacuum launcher wants stats about all databases */
+ /*
+ * Autovacuum launcher wants stats about all databases, but a shallow
+ * read is sufficient.
+ */
if (IsAutoVacuumLauncherProcess())
- pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
+ pgStatDBHash = pgstat_read_statsfile(InvalidOid, false, false);
else
- pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false);
+ pgStatDBHash = pgstat_read_statsfile(MyDatabaseId, false, true);
}
@@ -4084,26 +4391,53 @@ pgstat_clear_snapshot(void)
static void
pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
{
+ slist_iter iter;
+ bool found = false;
+ DBWriteRequest *newreq;
+ PgStat_StatDBEntry *dbentry;
+
+ elog(DEBUG1, "received inquiry for %d", msg->databaseid);
+
/*
- * Advance last_statrequest if this requestor has a newer cutoff time
- * than any previous request.
+ * Find the last write request for this DB (found=true in that case). Plain
+ * linear search, not really worth doing any magic here (probably).
*/
- if (msg->cutoff_time > last_statrequest)
- last_statrequest = msg->cutoff_time;
+ slist_foreach(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur);
+
+ if (req->databaseid != msg->databaseid)
+ continue;
+
+ if (msg->cutoff_time > req->request_time)
+ req->request_time = msg->cutoff_time;
+ found = true;
+ return;
+ }
/*
- * If the requestor's local clock time is older than last_statwrite, we
+ * There's no request for this DB yet, so create one.
+ */
+ newreq = palloc(sizeof(DBWriteRequest));
+
+ newreq->databaseid = msg->databaseid;
+ newreq->request_time = msg->clock_time;
+ slist_push_head(&last_statrequests, &newreq->next);
+
+ /*
+ * If the requestor's local clock time is older than stats_timestamp, we
* should suspect a clock glitch, ie system time going backwards; though
* the more likely explanation is just delayed message receipt. It is
* worth expending a GetCurrentTimestamp call to be sure, since a large
* retreat in the system clock reading could otherwise cause us to neglect
* to update the stats file for a long time.
*/
- if (msg->clock_time < last_statwrite)
+ dbentry = pgstat_get_db_entry(msg->databaseid, false);
+ if ((dbentry != NULL) && (msg->clock_time < dbentry->stats_timestamp))
{
TimestampTz cur_ts = GetCurrentTimestamp();
- if (cur_ts < last_statwrite)
+ if (cur_ts < dbentry->stats_timestamp)
{
/*
* Sure enough, time went backwards. Force a new stats file write
@@ -4113,15 +4447,16 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
char *mytime;
/* Copy because timestamptz_to_str returns a static buffer */
- writetime = pstrdup(timestamptz_to_str(last_statwrite));
+ writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
mytime = pstrdup(timestamptz_to_str(cur_ts));
- elog(LOG, "last_statwrite %s is later than collector's time %s",
- writetime, mytime);
+ elog(LOG,
+ "stats_timestamp %s is later than collector's time %s for db %d",
+ writetime, mytime, dbentry->databaseid);
pfree(writetime);
pfree(mytime);
- last_statrequest = cur_ts;
- last_statwrite = last_statrequest - 1;
+ newreq->request_time = cur_ts;
+ dbentry->stats_timestamp = cur_ts - 1;
}
}
}
@@ -4270,29 +4605,36 @@ pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
static void
pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
{
+ Oid dbid = msg->m_databaseid;
PgStat_StatDBEntry *dbentry;
/*
* Lookup the database in the hashtable.
*/
- dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
+ dbentry = pgstat_get_db_entry(dbid, false);
/*
- * If found, remove it.
+ * If found, remove it (along with the db statfile).
*/
if (dbentry)
{
+ char statfile[MAXPGPATH];
+
+ get_dbstat_filename(true, false, dbid, statfile, MAXPGPATH);
+
+ elog(DEBUG1, "removing %s", statfile);
+ unlink(statfile);
+
if (dbentry->tables != NULL)
hash_destroy(dbentry->tables);
if (dbentry->functions != NULL)
hash_destroy(dbentry->functions);
if (hash_search(pgStatDBHash,
- (void *) &(dbentry->databaseid),
+ (void *) &dbid,
HASH_REMOVE, NULL) == NULL)
ereport(ERROR,
- (errmsg("database hash table corrupted "
- "during cleanup --- abort")));
+ (errmsg("database hash table corrupted during cleanup --- abort")));
}
}
@@ -4687,3 +5029,43 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
HASH_REMOVE, NULL);
}
}
+
+/* ----------
+ * pgstat_write_statsfile_needed() -
+ *
+ * Do we need to write out the files?
+ * ----------
+ */
+static bool
+pgstat_write_statsfile_needed(void)
+{
+ if (!slist_is_empty(&last_statrequests))
+ return true;
+
+ /* Everything was written recently */
+ return false;
+}
+
+/* ----------
+ * pgstat_db_requested() -
+ *
+ * Checks whether stats for a particular DB need to be written to a file.
+ * ----------
+ */
+
+static bool
+pgstat_db_requested(Oid databaseid)
+{
+ slist_iter iter;
+
+ /* Check the databases if they need to refresh the stats. */
+ slist_foreach(iter, &last_statrequests)
+ {
+ DBWriteRequest *req = slist_container(DBWriteRequest, next, iter.cur);
+
+ if (req->databaseid == databaseid)
+ return true;
+ }
+
+ return false;
+}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6128694..0a53bb7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -8704,14 +8704,25 @@ static void
assign_pgstat_temp_directory(const char *newval, void *extra)
{
/* check_canonical_path already canonicalized newval for us */
+ char *dname;
char *tname;
char *fname;
- tname = guc_malloc(ERROR, strlen(newval) + 12); /* /pgstat.tmp */
- sprintf(tname, "%s/pgstat.tmp", newval);
- fname = guc_malloc(ERROR, strlen(newval) + 13); /* /pgstat.stat */
- sprintf(fname, "%s/pgstat.stat", newval);
-
+ /* directory */
+ dname = guc_malloc(ERROR, strlen(newval) + 1); /* runtime dir */
+ sprintf(dname, "%s", newval);
+
+ /* global stats */
+ tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
+ sprintf(tname, "%s/global.tmp", newval);
+ fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
+ sprintf(fname, "%s/global.stat", newval);
+
+ if (pgstat_stat_directory)
+ free(pgstat_stat_directory);
+ pgstat_stat_directory = dname;
+ /* invalidate cached length in pgstat.c */
+ pgstat_stat_dbfile_maxlen = 0;
if (pgstat_stat_tmpname)
free(pgstat_stat_tmpname);
pgstat_stat_tmpname = tname;
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index b8faf9c..b501132 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -192,6 +192,7 @@ const char *subdirs[] = {
"base",
"base/1",
"pg_tblspc",
+ "pg_stat",
"pg_stat_tmp"
};
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 03c0174..1248f47 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -205,6 +205,7 @@ typedef struct PgStat_MsgInquiry
PgStat_MsgHdr m_hdr;
TimestampTz clock_time; /* observed local clock time */
TimestampTz cutoff_time; /* minimum acceptable file timestamp */
+ Oid databaseid; /* requested DB (InvalidOid => all DBs) */
} PgStat_MsgInquiry;
@@ -514,7 +515,7 @@ typedef union PgStat_Msg
* ------------------------------------------------------------
*/
-#define PGSTAT_FILE_FORMAT_ID 0x01A5BC9A
+#define PGSTAT_FILE_FORMAT_ID 0xA240CA47
/* ----------
* PgStat_StatDBEntry The collector's data per database
@@ -545,6 +546,7 @@ typedef struct PgStat_StatDBEntry
PgStat_Counter n_block_write_time;
TimestampTz stat_reset_timestamp;
+ TimestampTz stats_timestamp; /* time of db stats file update */
/*
* tables and functions must be last in the struct, because we don't write
@@ -722,6 +724,8 @@ extern bool pgstat_track_activities;
extern bool pgstat_track_counts;
extern int pgstat_track_functions;
extern PGDLLIMPORT int pgstat_track_activity_query_size;
+extern char *pgstat_stat_directory;
+extern int pgstat_stat_dbfile_maxlen;
extern char *pgstat_stat_tmpname;
extern char *pgstat_stat_filename;
Hi,
just a few charts from our production system, illustrating the impact of
this patch. We've deployed this patch on January 12 on our systems
running 9.1 (i.e. a backpatch), so we do have enough data to do some
nice pics from it. I don't expect the changes made to the patch since
then to affect the impact significantly.
I've chosen two systems with large numbers of databases (over 1000 on
each), each database contains multiple (possibly hundreds or more).
The "cpu" first charts show quantiles of CPU usage, and assuming that
the system usage did not change, there's a clear drop by about 15%
percent in both cases. This is a 8-core system (on AWS), so this means a
save of about 120% of one core.
The "disk usage" charts show how much space was needed for the stats
(placed on tmpfs filesystem, mounted at /mnt/pg_tmp). The filesystem max
size is 400MB and the files require ~100MB. With the unpatched code the
space required was actually ~200MB because of the copying.
regards
Tomas
Attachments:
system-a-cpu.pngimage/png; name=system-a-cpu.pngDownload
�PNG
IHDR ; } ���� sBIT��O� pHYs � ��+ IDATx���y|T����! $\���
.UQk���Vj[�j}��Z��?��RZ��vQT
J�U�.�
E��MQ����I d����L&�����3���9g&3�IfB��'���9����<�s��}[TUAAADV�#� � � �� )� � � �R,AAAd/�X� � ��^H�AA���b!� � "{!�BAAD�B�� � � ���AAA�)� � � �R,AAAd/�X� � ��^H�AA���b!� � "{!�BAAD��N���z���c���l!� � �H�v)����|���x<.��s���>k�ZM�AAA$En����c�[o���*��y�����n���;v���a�����[�4f� � � ���X�o���O���0a��c�VUU�n!� � �H��c,�?���nccc����������tAAADR��Xt���<