Re: ERROR: could not read block

Started by Magnus Haganderabout 20 years ago12 messages
#1Magnus Hagander
mha@sollentuna.net

[copying this one over to hackers]

Our DBAs reviewed the Microsoft documentation you referenced,
modified the registry, and rebooted the OS. We've been
beating up on the database without seeing the error so far.
We'll keep at it for a while.

Very interesting. As this seems to be a resource error, a couple of
questions. Sorry if you've already answered some of them, couldn't find
it in the archives.

1) Is this a dedicated pg server, or does it have something else on it?

2) We have to ask this - do you run any antivirus on it, that might nto
be releasing resources the right way? Anything else that might stick in
a kernel driver?

3) Are you hitting the database with many connections, or is this a
single/few connection scenario? Are the other connections typically
active when this shows up?

Seems like we could just retry when we get this failure. The question is
we need to do a small amount of sleep before we do? Also, we can't just
retry forever, there has to be some kind of end to it...
(If you read the SQL kb, it can be read as retrying is the correct
thing, because the bug in sql was that it didn't retry)

//Magnus

#2Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Magnus Hagander (#1)
Re: [ADMIN] ERROR: could not read block

""Magnus Hagander"" <mha@sollentuna.net> wrote

Seems like we could just retry when we get this failure. The question is
we need to do a small amount of sleep before we do? Also, we can't just
retry forever, there has to be some kind of end to it...
(If you read the SQL kb, it can be read as retrying is the correct
thing, because the bug in sql was that it didn't retry)

Agree on the retry solution. Yes, two important factors are: intervals,
times. I suspect if it is a dedicated server, serveral retry can handle it.
But for a server might running backup together, who knows how long we need.
But in either way, I don't think an endless loop is needed -- at most 3
minutes (since s_lock() does this :-)).

Also, this is a partial solution to the "invalid parameter" win32 IO
problem. There are some other cases like ACESS_VIOLATION error need more
evidence to pin down.

Regards,
Qingqing

P.s. Go to be out of town for several days ...

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Qingqing Zhou (#2)
Re: [HACKERS] ERROR: could not read block

1) We run a couple Java applications on the same box to provide
middle tier access. When the box is heavily loaded, I think I've
seen about 80% PostgreSQL, 20% Java load.

2) I checked that no antivirus software was running, and had the
techs pare down the services running on that box to the absolute
minimum after the second failure, so that we could eliminate such
issues as possible causes.

3) The aforementioned Java apps hold open 21 database
connections. (One for a software publisher to query a list of jar
files for access to the database, and 20 for a connection pool in
the middle tier.) The way the pool is configured, six of those are
used for queries of normal priority, so we rarely have more than
six connections doing anything an any one moment. During the
initial failure, the middle tier was under normal load, so 45,000
inserts were made to the table in question during the ujpdate.
After we hit the problem, we removed that middle tier from the
list of targets, so it was running, but totally idle during the
remaining tests.

None of this seems material, however. It's pretty clear that the
problem was exhaustion of the Windows page pool. Our Windows
experts have reconfigured the machine (which had been tuned
for Sybase ASE). Their changes have boosted the page pool
from 20,000 entries to 180,000 entries. We're continuing to test
to ensure that the problem is not showing up with this
configuration; but, so far, it looks good.

If we don't want to tell Windows users to make highly technical
changes to the Windows registry in order to use PostgreSQL,
it does seem wise to use retries, as has already been discussed
on this thread.

-Kevin

"Magnus Hagander" <mha@sollentuna.net> >>>

[copying this one over to hackers]

Our DBAs reviewed the Microsoft documentation you referenced,
modified the registry, and rebooted the OS. We've been
beating up on the database without seeing the error so far.
We'll keep at it for a while.

Very interesting. As this seems to be a resource error, a couple of
questions. Sorry if you've already answered some of them, couldn't find
it in the archives.

1) Is this a dedicated pg server, or does it have something else on it?

2) We have to ask this - do you run any antivirus on it, that might nto
be releasing resources the right way? Anything else that might stick in
a kernel driver?

3) Are you hitting the database with many connections, or is this a
single/few connection scenario? Are the other connections typically
active when this shows up?

Seems like we could just retry when we get this failure. The question is
we need to do a small amount of sleep before we do? Also, we can't just
retry forever, there has to be some kind of end to it...
(If you read the SQL kb, it can be read as retrying is the correct
thing, because the bug in sql was that it didn't retry)

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#3)
Re: [HACKERS] ERROR: could not read block

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

None of this seems material, however. It's pretty clear that the
problem was exhaustion of the Windows page pool.
...
If we don't want to tell Windows users to make highly technical
changes to the Windows registry in order to use PostgreSQL,
it does seem wise to use retries, as has already been discussed
on this thread.

Would a simple retry loop actually help? It's not clear to me how
persistent such a failure would be.

regards, tom lane

#5Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#4)
Re: [HACKERS] ERROR: could not read block

I'm not an expert on that, but it seems reasonable to me that the
page pool would free space as the I/O system caught up with
the load. Also, I'm going on what was said by Qingqing and
in one of the pages he referenced:

http://support.microsoft.com/default.aspx?scid=kb;en-us;274310

-Kevin

Tom Lane <tgl@sss.pgh.pa.us> >>>

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

None of this seems material, however. It's pretty clear that the
problem was exhaustion of the Windows page pool.
...
If we don't want to tell Windows users to make highly technical
changes to the Windows registry in order to use PostgreSQL,
it does seem wise to use retries, as has already been discussed
on this thread.

Would a simple retry loop actually help? It's not clear to me how
persistent such a failure would be.

regards, tom lane

#6Magnus Hagander
mha@sollentuna.net
In reply to: Kevin Grittner (#5)
Re: [HACKERS] ERROR: could not read block

Tom Lane <tgl@sss.pgh.pa.us> >>>

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:

None of this seems material, however. It's pretty clear that the
problem was exhaustion of the Windows page pool.
...
If we don't want to tell Windows users to make highly technical
changes to the Windows registry in order to use PostgreSQL, it does
seem wise to use retries, as has already been discussed on this
thread.

Would a simple retry loop actually help? It's not clear to
me how persistent such a failure would be.

(Not sure why I didn't get Toms mail - lists acting up again? Anyway, I
got Kevins response, but am responding primarily to Tom)

The way I read it, a delay should help. It's basically running out of
kernel buffers, and we just delay, somebody else (another process, or an
IRQ handler, or whatever) should get finished with their I/O, free up
the buffer, and let us have it. Looking around a bit I see several
references that you should retry on it, but nothing in the API docs.
I do think it's probably a good idea to do a short delay before retrying
- at least to yield the CPU for one slice. That would greatly increase
the probability of someone else finishing their I/O...

That's how I read it, but I'm not 100% sure.

//Magnus

#7Magnus Hagander
mha@sollentuna.net
In reply to: Magnus Hagander (#6)
Re: [HACKERS] ERROR: could not read block

None of this seems material, however. It's pretty clear that
the problem was exhaustion of the Windows page pool. Our
Windows experts have reconfigured the machine (which had been
tuned for Sybase ASE). Their changes have boosted the page
pool from 20,000 entries to 180,000 entries. We're
continuing to test to ensure that the problem is not showing
up with this configuration; but, so far, it looks good.

Nope, with these numbers it doesn't. I was looking for a reason as to
why it would exhaust the pool - such as a huge number of connections.
Which doesn't appear to be so :-(

Another thing that will affect this is if you have a lot of network
sockets open. Anything like that?

BTW; do you get any eventid 2020 in your eventlog?

If we don't want to tell Windows users to make highly
technical changes to the Windows registry in order to use
PostgreSQL, it does seem wise to use retries, as has already
been discussed on this thread.

Yeah, I think it's at least worth a try at that.

//Magnus

#8Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Magnus Hagander (#7)
Re: [HACKERS] ERROR: could not read block

There weren't a large number of connections -- it seemed to be
that the one big update query, by itself, would do this. It seemed
to get through a lot of rows before failing. This table is normally
"insert only" -- so it would likely be getting most or all of the space
for inserting the updated rows from extending the table. Also, the
only reasonable plan for this update would be a table scan, so
it is possible that the failure occurred some time after the scan got
to rows added by the update statement.

It appears that the techs cleared the eventlog when they
reconfigured the machine, so I can no longer check for events
from the failures. :-(

-Kevin

"Magnus Hagander" <mha@sollentuna.net> >>>

None of this seems material, however. It's pretty clear that
the problem was exhaustion of the Windows page pool. Our
Windows experts have reconfigured the machine (which had been
tuned for Sybase ASE). Their changes have boosted the page
pool from 20,000 entries to 180,000 entries. We're
continuing to test to ensure that the problem is not showing
up with this configuration; but, so far, it looks good.

Nope, with these numbers it doesn't. I was looking for a reason as to
why it would exhaust the pool - such as a huge number of connections.
Which doesn't appear to be so :-(

Another thing that will affect this is if you have a lot of network
sockets open. Anything like that?

BTW; do you get any eventid 2020 in your eventlog?

If we don't want to tell Windows users to make highly
technical changes to the Windows registry in order to use
PostgreSQL, it does seem wise to use retries, as has already
been discussed on this thread.

Yeah, I think it's at least worth a try at that.

//Magnus

#9Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Kevin Grittner (#8)
Re: [HACKERS] ERROR: could not read block

A couple clarifications:

There were only a few network sockets open.

I'm told that the eventlog was reviewed for any events which
mgiht be related to the failures before it was cleared. They
found none, so that makes it fairly certain there was no 2020
event.

-Kevin

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> >>>

There weren't a large number of connections -- it seemed to be
that the one big update query, by itself, would do this. It seemed
to get through a lot of rows before failing. This table is normally
"insert only" -- so it would likely be getting most or all of the space
for inserting the updated rows from extending the table. Also, the
only reasonable plan for this update would be a table scan, so
it is possible that the failure occurred some time after the scan got
to rows added by the update statement.

It appears that the techs cleared the eventlog when they
reconfigured the machine, so I can no longer check for events
from the failures. :-(

-Kevin

"Magnus Hagander" <mha@sollentuna.net> >>>

None of this seems material, however. It's pretty clear that
the problem was exhaustion of the Windows page pool. Our
Windows experts have reconfigured the machine (which had been
tuned for Sybase ASE). Their changes have boosted the page
pool from 20,000 entries to 180,000 entries. We're
continuing to test to ensure that the problem is not showing
up with this configuration; but, so far, it looks good.

Nope, with these numbers it doesn't. I was looking for a reason as to
why it would exhaust the pool - such as a huge number of connections.
Which doesn't appear to be so :-(

Another thing that will affect this is if you have a lot of network
sockets open. Anything like that?

BTW; do you get any eventid 2020 in your eventlog?

If we don't want to tell Windows users to make highly
technical changes to the Windows registry in order to use
PostgreSQL, it does seem wise to use retries, as has already
been discussed on this thread.

Yeah, I think it's at least worth a try at that.

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

#10Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Kevin Grittner (#3)
Re: [ADMIN] ERROR: could not read block

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Would a simple retry loop actually help? It's not clear to me how
persistent such a failure would be.

[with reply to all followup threads] Yeah, this is the key and we definitely
have no 100% guarantee that several retries will solve the problem - just as
the situation in pg_unlink/pg_rename. But shall we do something now? If
Kevin could help on testing(you may have to revert the registry changes :-()
, I would like to send a patch in the retry style.

Regards,
Qingqing

#11Jim C. Nasby
jnasby@pervasive.com
In reply to: Magnus Hagander (#6)
Re: [HACKERS] ERROR: could not read block

On Thu, Nov 17, 2005 at 07:56:21PM +0100, Magnus Hagander wrote:

The way I read it, a delay should help. It's basically running out of
kernel buffers, and we just delay, somebody else (another process, or an
IRQ handler, or whatever) should get finished with their I/O, free up
the buffer, and let us have it. Looking around a bit I see several
references that you should retry on it, but nothing in the API docs.
I do think it's probably a good idea to do a short delay before retrying
- at least to yield the CPU for one slice. That would greatly increase
the probability of someone else finishing their I/O...

If that makes it into code, ISTM it would be good if it also threw a
NOTICE so that users could see if this was happening; kinda like the
notice about log files being recycled frequently.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#12Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Magnus Hagander (#6)
Re: [ADMIN] ERROR: could not read block

""Magnus Hagander"" <mha@sollentuna.net> wrote

The way I read it, a delay should help. It's basically running out of
kernel buffers, and we just delay, somebody else (another process, or an
IRQ handler, or whatever) should get finished with their I/O, free up
the buffer, and let us have it. Looking around a bit I see several
references that you should retry on it, but nothing in the API docs.
I do think it's probably a good idea to do a short delay before retrying
- at least to yield the CPU for one slice. That would greatly increase
the probability of someone else finishing their I/O...

More I read on the second thread:

" NTBackupread and NTBackupwrite both use buffered I/O. This means that
Windows NT caches the I/O that is performed against the stream. It is also
the only API that will back up the metadata of a file. This cache is pulled
from limited resources: namely, pool and nonpaged pool. Because of this,
extremely large numbers of files or files that are very large may cause the
pool resources to run low. "

So does it imply that if we use unbuffered I/O in Windows system will
elminate this problem? If so, just add FILE_FLAG_NO_BUFFERING when we open
data file will solve the problem -- but this change in fact very invasive,
because it will make the strategy of server I/O optimization totally
different from *nix.

Regards,
Qingqing