Overflow of bgwriter's request queue
Hi Hackers,
I encountered overflow of bgwriter's file-fsync request queue. It occurred
during checkpoints. Each backend would call fsync disorderly in such cases,
so that the checkpoint takes a long time and the performance has decreased.
It seems to happen frequently on the machines with a lot of memories and
poor disks.
I assume that the cause of this problem is that AbsorbFsyncRequests is not
called for a long time during checkpoints. The attached patch is one of
the solutions for it. It eliminates duplicate requests when the queue
is full, with a simple sort and unique technique.
I hope this problem will be solved by some methods.
---
ITAGAKI Takahiro
NTT Cyber Space Laboratories
Attachments:
bgwriter-requests-queue-overflow.patchapplication/octet-stream; name=bgwriter-requests-queue-overflow.patchDownload
diff -cpr pgsql-orig/src/backend/postmaster/bgwriter.c pgsql/src/backend/postmaster/bgwriter.c
*** pgsql-orig/src/backend/postmaster/bgwriter.c 2006-01-10 22:03:59.000000000 +0900
--- pgsql/src/backend/postmaster/bgwriter.c 2006-01-10 22:19:50.000000000 +0900
*************** static void bg_quickdie(SIGNAL_ARGS);
*** 150,155 ****
--- 150,159 ----
static void BgSigHupHandler(SIGNAL_ARGS);
static void ReqCheckpointHandler(SIGNAL_ARGS);
static void ReqShutdownHandler(SIGNAL_ARGS);
+ static int EliminateDupRequests(BgWriterRequest *requests, int num);
+ static int BgWriterRequestCompare(const void *v1, const void *v2);
+ static size_t unique(void *base, size_t nmemb, size_t size,
+ int (*compar)(const void *, const void *));
/*
*************** RequestCheckpoint(bool waitforit, bool w
*** 628,639 ****
*
* If we are unable to pass over the request (at present, this can happen
* if the shared memory queue is full), we return false. That forces
! * the backend to do its own fsync. We hope that will be even more seldom.
*
* Note: we presently make no attempt to eliminate duplicate requests
! * in the requests[] queue. The bgwriter will have to eliminate dups
! * internally anyway, so we may as well avoid holding the lock longer
! * than we have to here.
*/
bool
ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno)
--- 632,644 ----
*
* If we are unable to pass over the request (at present, this can happen
* if the shared memory queue is full), we return false. That forces
! * the backend to do its own fsync. We hope that will be even more seldom,
! * but in such cases, we will eliminate duplicate requests.
*
* Note: we presently make no attempt to eliminate duplicate requests
! * in the requests[] queue as long as it is not full. The bgwriter will
! * have to eliminate dups internally anyway, so we may as well avoid
! * holding the lock longer than we have to here.
*/
bool
ForwardFsyncRequest(RelFileNode rnode, BlockNumber segno)
*************** ForwardFsyncRequest(RelFileNode rnode, B
*** 645,656 ****
Assert(BgWriterShmem != NULL);
LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
! if (BgWriterShmem->bgwriter_pid == 0 ||
! BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
{
LWLockRelease(BgWriterCommLock);
return false;
}
request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
request->rnode = rnode;
request->segno = segno;
--- 650,670 ----
Assert(BgWriterShmem != NULL);
LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
! if (BgWriterShmem->bgwriter_pid == 0)
{
LWLockRelease(BgWriterCommLock);
return false;
}
+ if (BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
+ {
+ BgWriterShmem->num_requests = EliminateDupRequests(
+ BgWriterShmem->requests, BgWriterShmem->num_requests);
+ if (BgWriterShmem->num_requests >= BgWriterShmem->max_requests)
+ {
+ LWLockRelease(BgWriterCommLock);
+ return false;
+ }
+ }
request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
request->rnode = rnode;
request->segno = segno;
*************** AbsorbFsyncRequests(void)
*** 710,712 ****
--- 724,778 ----
END_CRIT_SECTION();
}
+
+ static int
+ BgWriterRequestCompare(const void *v1, const void *v2)
+ {
+ const BgWriterRequest *lhs = (const BgWriterRequest *)v1;
+ const BgWriterRequest *rhs = (const BgWriterRequest *)v2;
+
+ if (lhs->rnode.spcNode < rhs->rnode.spcNode) return -1;
+ if (lhs->rnode.spcNode > rhs->rnode.spcNode) return 1;
+ if (lhs->rnode.dbNode < rhs->rnode.dbNode) return -1;
+ if (lhs->rnode.dbNode > rhs->rnode.dbNode) return 1;
+ if (lhs->rnode.relNode < rhs->rnode.relNode) return -1;
+ if (lhs->rnode.relNode > rhs->rnode.relNode) return 1;
+ if (lhs->segno < rhs->segno) return -1;
+ if (lhs->segno > rhs->segno) return 1;
+ return 0;
+ }
+
+ static size_t
+ unique(void *base, size_t nmemb, size_t size,
+ int (*compar)(const void *, const void *))
+ {
+ char *start = (char *) base;
+ char *stop = (char *) base + (size * (nmemb - 1));
+ char *dst = start;
+ char *p = start;
+
+ while (p <= stop)
+ {
+ size_t elements;
+ char *q = p;
+
+ while (q < stop && compar(q, q + size) != 0)
+ q = q + size;
+
+ elements = ((q - p) / size) + 1;
+ memmove(dst, p, size * elements);
+ dst = dst + size * elements;
+
+ p = q + size;
+ while (p <= stop && compar(q, p) == 0)
+ p = p + size;
+ }
+ return (dst - start) / size;
+ }
+
+ static int
+ EliminateDupRequests(BgWriterRequest *requests, int num)
+ {
+ qsort(requests, num, sizeof(BgWriterRequest), BgWriterRequestCompare);
+ return unique(requests, num, sizeof(BgWriterRequest), BgWriterRequestCompare);
+ }
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes:
I encountered overflow of bgwriter's file-fsync request queue. It occurred
during checkpoints. Each backend would call fsync disorderly in such cases,
so that the checkpoint takes a long time and the performance has decreased.
It seems to happen frequently on the machines with a lot of memories and
poor disks.
I can't help thinking that this is a situation that could only be got
into with a seriously misconfigured database --- per the comments for
ForwardFsyncRequest, we really don't want this code to run at all,
let alone run so often that a queue with NBuffers entries overflows.
What exactly are the test conditions under which you're seeing this
happen?
If there actually is a problem that needs to be solved, I think it'd be
better to try to do AbsorbFsyncRequests somewhere in the main checkpoint
loops. I don't like the idea of holding the BgWriterCommLock long
enough to do a qsort ... especially not if this occurs only with very
large NBuffers settings. Also, what if the qsort fails to eliminate any
duplicates, or eliminates only a few? You could get into a scenario
where the qsort gets repeated every few ForwardFsyncRequest calls, in
which case it'd become a drag on performance itself. (See also recent
discussion with Qingqing about converting BgWriterCommLock to a
spinlock. Though I was against that because no performance problem had
been shown, it could still become something we want to do ... but
putting a qsort here would foreclose that option.)
regards, tom lane
I'm sorry when you have received mails of the same content. I had sent
a mail but it seemed not to be delivered, so I'll send it again.
Tom Lane <tgl@sss.pgh.pa.us> wrote:
I encountered overflow of bgwriter's file-fsync request queue.
I can't help thinking that this is a situation that could only be got
into with a seriously misconfigured database --- per the comments for
ForwardFsyncRequest, we really don't want this code to run at all,
let alone run so often that a queue with NBuffers entries overflows.
What exactly are the test conditions under which you're seeing this
happen?
It happened at the two environments.
[1]: TPC-C(DBT-2) / RHEL4 U1 (2.6.9-11) XFS, 8 S-ATA disks / 8GB memory(shmem=512MB)
XFS, 8 S-ATA disks / 8GB memory(shmem=512MB)
[2]: TPC-C(DBT-2) / RHEL4 U2 (2.6.9-22) XFS, 6 SCSI disks / 6GB memory(shmem=1GB)
XFS, 6 SCSI disks / 6GB memory(shmem=1GB)
I think it is not so bad configuration. There seems to be a problem in
the combination of XFS and heavy update workloads, but the total throuput
at XFS with my patch was better than ext3.
I suspect that NBuffers for the queue length is not enough. If all buffers
are dirty, ForwardFsyncRequest would be called more than NBuffers times
during BufferSync, so the queue could become full.
If there actually is a problem that needs to be solved, I think it'd be
better to try to do AbsorbFsyncRequests somewhere in the main checkpoint
loops. I don't like the idea of holding the BgWriterCommLock long
enough to do a qsort ... especially not if this occurs only with very
large NBuffers settings.
Ok, I agree. I sent PATCHES a patch that calls AbsorbFsyncRequests
in the loops of BufferSync and mdsync.
Also, what if the qsort fails to eliminate any
duplicates, or eliminates only a few? You could get into a scenario
where the qsort gets repeated every few ForwardFsyncRequest calls, in
which case it'd become a drag on performance itself.
Now, I think the above solution is better than qsort, but qsort will also
work not so bad. NBuffers is at least one thousand, while the count of files
that needs fsync is at most hundreds, so duplidate elimination will work well.
In fact, in my machine, the queue became full twice in a checkpoint and
length of the queue decreased from 65536 to *32* by duplicate eliminations.
---
ITAGAKI Takahiro
NTT Cyber Space Laboratories