win32 _dosmaperr()

Started by Qingqing Zhoualmost 21 years ago7 messageshackers
Jump to latest
#1Qingqing Zhou
zhouqq@cs.toronto.edu

There were several reports of "unable to read/write" on Pg8.0.x win32 port:

http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php

I encounter this several times and finally I catch the GetLastError()
number. It is

32, ERROR_SHARING_VIOLATION
The process cannot access the file because it is being used by another
process.

But PG server error message is "invalid parameter" which makes this error
difficult to understand and track. After examing win32 CRT's _dosmaperr()
implementation, I found they failed to transalte ERROR_SHARING_VIOLATION, so
the default errno is set to EINVAL. To solve it, we can do our own
_dosmaperr(GetLastError()) again if read/write failed. Unfortunately our
_dosmaperr() failed to do so either, so here is a patch of error.c. Also, I
raised the error level to NOTICE for better bug report. If this is
acceptable, I will patch FileRead()/FileWrite() etc.

However, I am very sure why this could happen. That is, who uses the data
file in a non-sharing mode? There are many possibilities, a common concensus
is [Anti-]virus software. Yes, I do have one installed. If we can confirm
this, then we could at least print a hint message.

Regards,
Qingqing

Index: error.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/port/win32/error.c,v
retrieving revision 1.4
diff -c -r1.4 error.c
*** error.c 31 Dec 2004 22:00:37 -0000 1.4
--- error.c 13 Jul 2005 09:04:57 -0000
***************
*** 72,77 ****
--- 72,80 ----
    ERROR_NO_MORE_FILES, ENOENT
   },
   {
+   ERROR_SHARING_VIOLATION, EACCES
+  },
+  {
    ERROR_LOCK_VIOLATION, EACCES
   },
   {
***************
*** 180,188 ****
    }
   }
!  ereport(DEBUG4,
!    (errmsg_internal("Unknown win32 error code: %i",
!         (int) e)));
   errno = EINVAL;
   return;
  }
--- 183,192 ----
    }
   }

! ereport(NOTICE,
! (errmsg_internal("Unknown win32 error code: %i. "
! "Please report to <pgsql-bugs@postgresql.org>.",
! (int) e)));
errno = EINVAL;
return;
}

#2Magnus Hagander
magnus@hagander.net
In reply to: Qingqing Zhou (#1)
Re: win32 _dosmaperr()

There were several reports of "unable to read/write" on
Pg8.0.x win32 port:

http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php

I encounter this several times and finally I catch the
GetLastError() number. It is

32, ERROR_SHARING_VIOLATION
The process cannot access the file because it is being
used by another process.

But PG server error message is "invalid parameter" which
makes this error difficult to understand and track. After
examing win32 CRT's _dosmaperr() implementation, I found they
failed to transalte ERROR_SHARING_VIOLATION, so the default
errno is set to EINVAL. To solve it, we can do our own
_dosmaperr(GetLastError()) again if read/write failed.
Unfortunately our
_dosmaperr() failed to do so either, so here is a patch of
error.c. Also, I raised the error level to NOTICE for better
bug report. If this is acceptable, I will patch
FileRead()/FileWrite() etc.

Seems reasonable.

However, I am very sure why this could happen. That is, who
uses the data file in a non-sharing mode? There are many
possibilities, a common concensus is [Anti-]virus software.
Yes, I do have one installed. If we can confirm this, then we
could at least print a hint message.

I would suspect either AV software and/or backup software not excluding
the pg data files.

I suggest you try using Process Explorer from www.sysinternals.com to
figure out who has the file open. Most of the time it should be able to
tell you exactly who has locked the file - at least as long as it's done
from userspace. I'm not 100% sure on how it deals with kernel level
locks.

//Magnus

#3Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Magnus Hagander (#2)
Re: win32 _dosmaperr()

""Magnus Hagander"" <mha@sollentuna.net> writes

I suggest you try using Process Explorer from www.sysinternals.com to
figure out who has the file open. Most of the time it should be able to
tell you exactly who has locked the file - at least as long as it's done
from userspace. I'm not 100% sure on how it deals with kernel level
locks.

Yes, "handle" (also in sysinternal's site) might be an alternative. I've
add a call to "handle" to catch all the open handles of DataDir when the
error is trapped.

Regards,
Qingqing

#4Merlin Moncure
merlin.moncure@rcsonline.com
In reply to: Qingqing Zhou (#3)
Re: win32 _dosmaperr()

Qingqing wrote:

There were several reports of "unable to read/write" on Pg8.0.x win32
port:

http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php

I encounter this several times and finally I catch the GetLastError()
number. It is

32, ERROR_SHARING_VIOLATION
The process cannot access the file because it is being used by

another

process.

But PG server error message is "invalid parameter" which makes this

error

difficult to understand and track. After examing win32 CRT's

_dosmaperr()

implementation, I found they failed to transalte

ERROR_SHARING_VIOLATION,

so
the default errno is set to EINVAL. To solve it, we can do our own
_dosmaperr(GetLastError()) again if read/write failed. Unfortunately

our

_dosmaperr() failed to do so either, so here is a patch of error.c.

Also,

I
raised the error level to NOTICE for better bug report. If this is
acceptable, I will patch FileRead()/FileWrite() etc.

However, I am very sure why this could happen. That is, who uses the

data

file in a non-sharing mode? There are many possibilities, a common
concensus
is [Anti-]virus software. Yes, I do have one installed. If we can

confirm

this, then we could at least print a hint message.

I had similar problems since the early days of the win32 port, random
restarts of the stats collector and other unexplainable things. This
only ever happened under heavy loads (1000+/sec sustained query
processing) with statement level stats on. This played havoc with my
user diagnostic tools because it randomly restarted the stats collector
so I've had to keep row level stats off.

Merlin

#5Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Magnus Hagander (#2)
Re: win32 _dosmaperr()

""Magnus Hagander"" <mha@sollentuna.net> writes

I suggest you try using Process Explorer from www.sysinternals.com to
figure out who has the file open. Most of the time it should be able to
tell you exactly who has locked the file - at least as long as it's done
from userspace. I'm not 100% sure on how it deals with kernel level
locks.

After runing PG win32 (8.0.1) sever for a while and mix some heavy
transactions like checkpoint, vacuum together, I encountered another problem
should be in the same category. PG reports:

"could not unlink 0000xxxx, continuing to try"

at dirmod.c/pgunlink() and deadloops there. I use the PE tool you mentioned,
I found there are only 3 processes hold the handle of the problematic xlog
segment, all of them are postgres backends. Using the FileMon tool from the
same website, I found that bgwriter tried to OPEN the xlog segment with ALL
ACCESS but failed with result DELETE PEND.

That is to say, under some conditions, even if I opened file with
SHARED_DELETE flag, I may not remove the file when it is open? I did some
tests, but every time I delete/rename an opened file, I could make it.

Things could get worse because the whole database cluster may stop working
and waiting for the buffer the bgwriter is working on, but bgwriter is
waiting for (by the deadloop in pgunlink) those postgres'es to move on (so
that they could close the problematic xlog segment), which is a deadlock.

Regards,
Qingqing

#6Bruce Momjian
bruce@momjian.us
In reply to: Qingqing Zhou (#5)
Re: win32 _dosmaperr()

Interesting. Are you sure all those processes were using our standard
flags? Seems unusual and you are right, it shouldn't be happening.

---------------------------------------------------------------------------

Qingqing Zhou wrote:

""Magnus Hagander"" <mha@sollentuna.net> writes

I suggest you try using Process Explorer from www.sysinternals.com to
figure out who has the file open. Most of the time it should be able to
tell you exactly who has locked the file - at least as long as it's done
from userspace. I'm not 100% sure on how it deals with kernel level
locks.

After runing PG win32 (8.0.1) sever for a while and mix some heavy
transactions like checkpoint, vacuum together, I encountered another problem
should be in the same category. PG reports:

"could not unlink 0000xxxx, continuing to try"

at dirmod.c/pgunlink() and deadloops there. I use the PE tool you mentioned,
I found there are only 3 processes hold the handle of the problematic xlog
segment, all of them are postgres backends. Using the FileMon tool from the
same website, I found that bgwriter tried to OPEN the xlog segment with ALL
ACCESS but failed with result DELETE PEND.

That is to say, under some conditions, even if I opened file with
SHARED_DELETE flag, I may not remove the file when it is open? I did some
tests, but every time I delete/rename an opened file, I could make it.

Things could get worse because the whole database cluster may stop working
and waiting for the buffer the bgwriter is working on, but bgwriter is
waiting for (by the deadloop in pgunlink) those postgres'es to move on (so
that they could close the problematic xlog segment), which is a deadlock.

Regards,
Qingqing

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#6)
Re: win32 _dosmaperr()

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Qingqing Zhou wrote:

Things could get worse because the whole database cluster may stop working
and waiting for the buffer the bgwriter is working on, but bgwriter is
waiting for (by the deadloop in pgunlink) those postgres'es to move on (so
that they could close the problematic xlog segment), which is a deadlock.

I think that analysis is bogus. The bgwriter only tries to unlink xlog
segments during post-checkpoint cleanup, at which point it isn't holding
any buffer locks. Likewise, while backends might wait trying to remove
a table file because the bgwriter has the file open, in that state they
aren't blocking the bgwriter either.

In the latter case, the backends will have to wait till the bgwriter
closes the file, which it'll do not later than the next checkpoint.
I wonder whether the complaints are coming from people who don't know
about that, and didn't wait long enough?

There could be a deadlock if a backend is holding open an old xlog
segment while it executes a CHECKPOINT command, because then it'll
wait for the bgwriter, and the bgwriter might think it could remove
the xlog file during the checkpoint.

Another form could only happen between two backends: A is trying to
unlink file F, which backend B has open, and then for some unrelated
reason B has to wait for a lock held by A. The bgwriter doesn't take
nor wait for locks so this doesn't apply to it.

But none of this should be happening because we're supposedly always
opening all these files with the magic sharing flag.

regards, tom lane