buildfarm logging versus embedded nulls

Started by Tom Lanealmost 16 years ago15 messages

tgl@sss.pgh.pa.us

almost 16 years ago

I was looking at this recent nonrepeatable buildfarm failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=polecat&dt=2010-03-11%2021:45:10
which has several instances of the known "pgstat wait timeout" problem
during the parallel regression tests.

I was disappointed to see that the postmaster log part of the report
is truncated near the start of the run, so there's no way to see if
anything interesting got logged near the point of the failure.

When I run "make check" on my own OS X machine, I notice that the
postmaster.log file usually has some runs of a few dozen null bytes in
it. I suspect this is an artifact of Apple's stdio buffering
implementation when several backends are writing to the same log file.
I suppose that what happened in the above case is that some nulls got
embedded in postmaster.log, and then the file got truncated at the first
null, perhaps during the upload to the buildfarm server, or maybe it's
intact on the server but the web page is failing to display anything
past that point.

There's probably not much we can do about Apple's stdio (and I would bet
that they inherited this behavior from the BSDen anyway). What we
*could* do is

(1) encourage buildfarm owners to run the tests with logging_collector
turned on, and/or

(2) fix the buildfarm reporting mechanisms to not be fazed by nulls in
the log files.

I have no clear idea how hard either of these things is.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

almost 16 years ago

In reply to: Tom Lane (#1)

Re: buildfarm logging versus embedded nulls

Tom Lane wrote:

I was looking at this recent nonrepeatable buildfarm failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=polecat&dt=2010-03-11%2021:45:10
which has several instances of the known "pgstat wait timeout" problem
during the parallel regression tests.

I was disappointed to see that the postmaster log part of the report
is truncated near the start of the run, so there's no way to see if
anything interesting got logged near the point of the failure.

When I run "make check" on my own OS X machine, I notice that the
postmaster.log file usually has some runs of a few dozen null bytes in
it. I suspect this is an artifact of Apple's stdio buffering
implementation when several backends are writing to the same log file.
I suppose that what happened in the above case is that some nulls got
embedded in postmaster.log, and then the file got truncated at the first
null, perhaps during the upload to the buildfarm server, or maybe it's
intact on the server but the web page is failing to display anything
past that point.

There's probably not much we can do about Apple's stdio (and I would bet
that they inherited this behavior from the BSDen anyway). What we
*could* do is

(1) encourage buildfarm owners to run the tests with logging_collector
turned on, and/or

(2) fix the buildfarm reporting mechanisms to not be fazed by nulls in
the log files.

I have no clear idea how hard either of these things is.

Well, the good news is that we actually have the data on the server, in
a temp file that will be cleaned up, but hasn't been yet. I have placed
a copy at <http://buildfarm.postgresql.org/polecat-check.log.gz>. And
thus we know that the client does exactly what you want, and the problem
is entirely on the server. That's more good news.

Now, the log_text field in our build_status_log table is text, so it's
on insertion to the database that it gets truncated. I'm thinking that I
should just escape nulls with a regex ( 's/\x00/\\0/g' or similar)
before inserting the data.

Anyone got any better ideas?

(BTW, your idea of using logging_collector won't work without
significant changes in the buildfarm client. Nice idea, though.)

cheers

andrew

Tom Lane

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Andrew Dunstan (#2)

Re: buildfarm logging versus embedded nulls

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

I was looking at this recent nonrepeatable buildfarm failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=polecat&dt=2010-03-11%2021:45:10

Well, the good news is that we actually have the data on the server, in
a temp file that will be cleaned up, but hasn't been yet. I have placed
a copy at <http://buildfarm.postgresql.org/polecat-check.log.gz>.

Thanks for that. I found the relevant part of the postmaster log:

[4b99671f.ddb4:36] LOG: statement: DELETE FROM vactst WHERE i != 0;
[4b99671f.ddb4:37] LOG: statement: VACUUM (FULL) vactst;
[4b99671f.ddb4:38] WARNING: pgstat wait timeout
[4b99671f.ddb4:39] LOG: statement: DELETE FROM vactst;
[4b99671f.ddb4:40] LOG: statement: SELECT * FROM vactst;
[4b99671f.ddb4:41] LOG: statement: VACUUM (FULL, FREEZE) vactst;
[4b99671f.ddb4:42] WARNING: pgstat wait timeout
[4b99671f.ddb4:43] LOG: statement: VACUUM (ANALYZE, FULL) vactst;
[4b99671f.ddb4:44] WARNING: pgstat wait timeout
[4b99671f.ddb4:45] LOG: statement: CREATE TABLE vaccluster (i INT PRIMARY KEY);
[4b99671f.ddb4:46] LOG: statement: ALTER TABLE vaccluster CLUSTER ON vaccluster_pkey;
[4b99671f.ddb4:47] LOG: statement: INSERT INTO vaccluster SELECT * FROM vactst;
[4b99671f.ddb4:48] LOG: statement: CLUSTER vaccluster;
[4b99671f.ddb4:49] LOG: statement: VACUUM FULL pg_am;
[4b99671f.ddb4:50] WARNING: pgstat wait timeout
[4b99671f.ddb4:51] LOG: statement: VACUUM FULL pg_class;
[4b99671f.ddb4:52] WARNING: pgstat wait timeout
[4b99671f.ddb4:53] LOG: statement: VACUUM FULL pg_database;
[4b99671f.ddb4:54] WARNING: pgstat wait timeout
[4b99671f.ddb4:55] LOG: statement: VACUUM FULL vaccluster;
[4b996707.dcc0:2] WARNING: pgstat wait timeout
[4b99671f.ddb4:56] WARNING: pgstat wait timeout
[4b99671f.ddb4:57] LOG: statement: VACUUM FULL vactst;
[4b996707.dcc0:3] WARNING: pgstat wait timeout
[4b996747.ddcc:1] WARNING: pgstat wait timeout
[4b99671f.ddba:4] LOG: statement: INSERT INTO DEFAULT_TBL VALUES (1, 'thomas', 57.0613);

Most of the "pgstat wait timeout" gripes are coming from the backend
running the "vacuum" regression test, but there are two that came from
process dcc0, which is shown by other log entries to be the autovacuum
launcher. So now I'm wondering if there could be some issue that occurs
when the autovac launcher and a backend both want stats concurrently.
I have not got further than that in my uncaffeinated state --- anybody
have any ideas?

Now, the log_text field in our build_status_log table is text, so it's
on insertion to the database that it gets truncated. I'm thinking that I
should just escape nulls with a regex ( 's/\x00/\\0/g' or similar)
before inserting the data.

Works for me.

regards, tom lane

Alvaro Herrera

alvherre@commandprompt.com

almost 16 years ago

In reply to: Tom Lane (#3)

Re: buildfarm logging versus embedded nulls

Tom Lane wrote:

Most of the "pgstat wait timeout" gripes are coming from the backend
running the "vacuum" regression test, but there are two that came from
process dcc0, which is shown by other log entries to be the autovacuum
launcher. So now I'm wondering if there could be some issue that occurs
when the autovac launcher and a backend both want stats concurrently.

Seems like this issue would exist whenever two backends want stats
concurrently, no? Is this well-tested?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Andrew Dunstan

andrew@dunslane.net

almost 16 years ago

In reply to: Tom Lane (#3)

Re: buildfarm logging versus embedded nulls

Tom Lane wrote:

Now, the log_text field in our build_status_log table is text, so it's
on insertion to the database that it gets truncated. I'm thinking that I
should just escape nulls with a regex ( 's/\x00/\\0/g' or similar)
before inserting the data.

Works for me.

Done.

cheers

andrew

Tom Lane

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Alvaro Herrera (#4)

Re: buildfarm logging versus embedded nulls

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

Most of the "pgstat wait timeout" gripes are coming from the backend
running the "vacuum" regression test, but there are two that came from
process dcc0, which is shown by other log entries to be the autovacuum
launcher. So now I'm wondering if there could be some issue that occurs
when the autovac launcher and a backend both want stats concurrently.

Seems like this issue would exist whenever two backends want stats
concurrently, no? Is this well-tested?

Well, you tell me how much it's been tested ;-). But the autovac
launcher could possibly interact differently than another backend,
since it has a different value for maximum stats file age.

Anyway it's only a guess. It could well be that that machine was simply
so heavily loaded that the stats collector couldn't respond fast enough.
I'm just wondering whether there's an unrecognized bug lurking here.

regards, tom lane

Robert Creager

robert@logicalchaos.org

almost 16 years ago

In reply to: Andrew Dunstan (#2)

Re: buildfarm logging versus embedded nulls

On Mar 11, 2010, at 6:00 PM, Andrew Dunstan wrote:

Tom Lane wrote:

I was looking at this recent nonrepeatable buildfarm failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=polecat&dt=2010-03-11%2021:45:10
which has several instances of the known "pgstat wait timeout" problem
during the parallel regression tests.

You've got me trained well now. I'm now looking at my build machine failures. Wasn't sure what to do about that one, since no relevant files changed.

Is there any value in setting "keep_error_builds => 0,"? I know Andrew was able to get the complete log file.

Cheers,
Rob

Andrew Dunstan

andrew@dunslane.net

almost 16 years ago

In reply to: Robert Creager (#7)

Re: buildfarm logging versus embedded nulls

Robert Creager wrote:

On Mar 11, 2010, at 6:00 PM, Andrew Dunstan wrote:

Tom Lane wrote:

I was looking at this recent nonrepeatable buildfarm failure:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=polecat&dt=2010-03-11%2021:45:10
which has several instances of the known "pgstat wait timeout" problem
during the parallel regression tests.

You've got me trained well now. I'm now looking at my build machine failures. Wasn't sure what to do about that one, since no relevant files changed.

Is there any value in setting "keep_error_builds => 0,"? I know Andrew was able to get the complete log file.

You normally want this as 0, to avoid eating up disk space. You
certainly don't want it non-zero for long unless you have many Gb to spare.

I doubt keeping this particular build would have helped much. The build
was probably fine, the bug is apparently triggered by some hard to
repeat timing condition, from what I gather from Tom's analysis.

And from now on we will not have logs truncated by the presence of nul
bytes.

cheers

andrew

Tom Lane

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Robert Creager (#7)

Re: buildfarm logging versus embedded nulls

Robert Creager <robert@logicalchaos.org> writes:

Is there any value in setting "keep_error_builds => 0,"? I know Andrew was able to get the complete log file.

The data is all uploaded to the buildfarm server, so as long as EDB
doesn't holler uncle about the amount of storage that's taking, I don't
think you need to keep 'em locally.

regards, tom lane

#10

Andrew Dunstan

andrew@dunslane.net

almost 16 years ago

In reply to: Tom Lane (#9)

Re: buildfarm logging versus embedded nulls

Tom Lane wrote:

Robert Creager <robert@logicalchaos.org> writes:

Is there any value in setting "keep_error_builds => 0,"? I know Andrew was able to get the complete log file.

The data is all uploaded to the buildfarm server, so as long as EDB
doesn't holler uncle about the amount of storage that's taking, I don't
think you need to keep 'em locally.

Er, that's CMD who host it. Credit where credit is due ;-)

One of these days they will get upset and we'll purge the back logs so
we only keep the last 6 months or a year.

cheers

andrew

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Tom Lane (#6)

Re: buildfarm logging versus embedded nulls

I wrote:

Anyway it's only a guess. It could well be that that machine was simply
so heavily loaded that the stats collector couldn't respond fast enough.
I'm just wondering whether there's an unrecognized bug lurking here.

Still meditating on this ... and it strikes me that the pgstat.c code
is really uncommunicative about problems. In particular,
pgstat_read_statsfile_timestamp and pgstat_read_statsfile don't complain
at all about being unable to read a stats file. It seems to me that the
only "expected" case is ENOENT (and even that isn't really expected, in
normal operation). Surely we should at least elog(LOG) any other
failure condition?

Another place that could probably do with elog(LOG) is where
pgstat_write_statsfile resets last_statrequest in case it's in the
future. That shouldn't ever happen. While the reset is probably
a good thing for robustness, wouldn't logging it be a good idea?

Lastly, backend_read_statsfile is designed to send an inquiry message
every time through the loop, ie, every 10 msec. This is said to be in
case the stats collector drops one. But is this enough to flood the
collector and make things worse? I wonder if there should be some
backoff there.

regards, tom lane

#12

Alvaro Herrera

alvherre@commandprompt.com

almost 16 years ago

In reply to: Tom Lane (#11)

Re: buildfarm logging versus embedded nulls

Tom Lane wrote:

I wrote:

Anyway it's only a guess. It could well be that that machine was simply
so heavily loaded that the stats collector couldn't respond fast enough.
I'm just wondering whether there's an unrecognized bug lurking here.

Still meditating on this ... and it strikes me that the pgstat.c code
is really uncommunicative about problems. In particular,
pgstat_read_statsfile_timestamp and pgstat_read_statsfile don't complain
at all about being unable to read a stats file.

Yeah, I had the same thought.

Lastly, backend_read_statsfile is designed to send an inquiry message
every time through the loop, ie, every 10 msec. This is said to be in
case the stats collector drops one. But is this enough to flood the
collector and make things worse? I wonder if there should be some
backoff there.

I also think the autovacuum worker minimum timestamp may be playing
games with the retry logic too. Maybe a worker is requesting a new file
continuously because pgstat is not able to provide one before the
deadline is past, and thus overloading it. I still think that 500ms is
too much for a worker, but backing off all the way to 10ms seems too
much. Maybe it should just be, say, 100ms.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#13

Tom Lane

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Alvaro Herrera (#12)

Re: buildfarm logging versus embedded nulls

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

Still meditating on this ... and it strikes me that the pgstat.c code
is really uncommunicative about problems. In particular,
pgstat_read_statsfile_timestamp and pgstat_read_statsfile don't complain
at all about being unable to read a stats file.

Yeah, I had the same thought.

OK, I'll add some logging.

Lastly, backend_read_statsfile is designed to send an inquiry message
every time through the loop, ie, every 10 msec. This is said to be in
case the stats collector drops one. But is this enough to flood the
collector and make things worse? I wonder if there should be some
backoff there.

I also think the autovacuum worker minimum timestamp may be playing
games with the retry logic too. Maybe a worker is requesting a new file
continuously because pgstat is not able to provide one before the
deadline is past, and thus overloading it. I still think that 500ms is
too much for a worker, but backing off all the way to 10ms seems too
much. Maybe it should just be, say, 100ms.

But we don't advance the deadline within the wait loop, so (in theory)
a single requestor shouldn't be able to trigger more than one stats file
update. I wonder though if an autovac worker could make many such
requests over its lifespan ...

regards, tom lane

#14

Alvaro Herrera

alvherre@commandprompt.com

almost 16 years ago

In reply to: Tom Lane (#13)

Re: buildfarm logging versus embedded nulls

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

I also think the autovacuum worker minimum timestamp may be playing
games with the retry logic too. Maybe a worker is requesting a new file
continuously because pgstat is not able to provide one before the
deadline is past, and thus overloading it. I still think that 500ms is
too much for a worker, but backing off all the way to 10ms seems too
much. Maybe it should just be, say, 100ms.

But we don't advance the deadline within the wait loop, so (in theory)
a single requestor shouldn't be able to trigger more than one stats file
update.

Hmm, yeah.

I wonder though if an autovac worker could make many such
requests over its lifespan ...

Well, yes, but it will request fresh stats only for the recheck logic
before each table, so there will be one intervening vacuum (or none,
actually, if the table was vacuumed by some other autovac worker.
Though given the default naptime of 1 min I find it unlikely that the
regression database will ever see more than one worker).

Since the warning comes from the launcher and not the worker, I wonder
if this is a red herring.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 16 years ago

In reply to: Alvaro Herrera (#14)

Re: buildfarm logging versus embedded nulls

Alvaro Herrera <alvherre@commandprompt.com> writes:

Since the warning comes from the launcher and not the worker, I wonder
if this is a red herring.

It's all speculation at the moment. So far there's not really enough
evidence to refute the idea that the system was just under heavy load
at that point --- except that even under heavy load it shouldn't take
the stats collector 5 seconds to write the stats file for the regression
database, ISTM.

I wonder if there is any practical way for the buildfarm client script
to report about the system's load average, or some other gauge of how
much is going on in the buildfarm machine besides the regression tests.
One thought is just to log how long it takes to run the regression
tests. A longer-than-usual run for a particular animal would be evidence
of a load spike; if we could correlate that with failures of this sort
it would be easier to write them off as heavy load.

regards, tom lane