COPY locking
What mode of locking is associated with a COPY to the database?
Is it a RowExclusiveLock and since it is brand new data, there should
be no conflicts?
I am trying to figure out what is slowing down multiple concurrent COPY
commands to my database and all I see is lots of semops. CPU usage,
disk access, etc. are not pegged, so what else is could be slowing me down?
--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas
John Coers <coers@intrinsity.com> writes:
I am trying to figure out what is slowing down multiple concurrent COPY
commands to my database and all I see is lots of semops.
Are you talking about concurrent copies into the same table? That would
suffer a lot of short-term lock interference, no doubt, since all the
copies are going to be writing the same disk page (ie, the current last
page of the table).
regards, tom lane
Tom Lane wrote:
John Coers <coers@intrinsity.com> writes:
I am trying to figure out what is slowing down multiple concurrent COPY
commands to my database and all I see is lots of semops.Are you talking about concurrent copies into the same table? That would
suffer a lot of short-term lock interference, no doubt, since all the
copies are going to be writing the same disk page (ie, the current last
page of the table).regards, tom lane
Yes, that is exactly the problem. So there is not a lock per se? DO they each
right to their own piece of shared memory and then try and flush that to disk and that
is when they interfere?
Are there any suggested techniques or tweaks I can make to avoid this interference?
The techniques I have tried, in order of my perception of their effectiveness are:
1) set N=2. I have 150-250 client machines each trying to connect and COPY 10k-70k rows
of 3 ints. If a connection can't be made, the client waits and tries again in a few secs.
This has the most positive effect by FAR of anything I've tried. The negative side is that
if someone wants to query, they will hold up the copying of data TO the db by taking up a connection
even though they wouldn't otherwise block.
2) Set up multiple dbs on different disks. I can't see that this has much effect, but maybe a little.
3) Jacking up the shared memory, wal buffers, and running in silent mode.
--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas
John Coers <coers@intrinsity.com> writes:
Tom Lane wrote:
Are you talking about concurrent copies into the same table? That would
suffer a lot of short-term lock interference, no doubt, since all the
copies are going to be writing the same disk page (ie, the current last
page of the table).
Yes, that is exactly the problem. So there is not a lock per se? DO
they each right to their own piece of shared memory and then try and
flush that to disk and that is when they interfere?
No, they all share the same disk buffer in shared memory for whatever is
currently the last page of the file. Adding a tuple into that page
requires grabbing a short-term lock on that page (since obviously only
one process can do it at a time, else they'd be trying to insert into
the same physical spot). That lock is released as soon as the tuple is
physically stored in the buffer, but with enough processes running COPY
into the same table, you'll get contention for the lock.
I suspect the performance issue you're seeing is not so much the need
for a short-term lock (it's awful hard to see how to do without one,
anyway) as it is that our current spinlock implementation is pretty bad
in the face of heavy contention. On most platforms the backoff when
you can't get the lock on the first try is a ten-millisecond sleep,
which is huge overkill for locks that are only held for periods of
microseconds. There was discussion in pghackers a few months ago about
reimplementing spinlocks in some more modern way (eg, using Posix
semaphores if available) but no progress has been made yet.
Are there any suggested techniques or tweaks I can make to avoid this
interference?
Do you really need to run multiple COPYs in parallel, or would
serializing them be just as good? You could serialize them without
locking out readers by doing
BEGIN;
LOCK TABLE foo IN EXCLUSIVE MODE;
COPY foo FROM ...
END;
"EXCLUSIVE" mode isn't quite as exclusive as a plain LOCK TABLE; it
still allows other readers. See "Table-level locks" in the User's
Guide.
regards, tom lane
No, they all share the same disk buffer in shared memory for whatever is
currently the last page of the file. Adding a tuple into that page
requires grabbing a short-term lock on that page (since obviously only
one process can do it at a time, else they'd be trying to insert into
the same physical spot). That lock is released as soon as the tuple is
physically stored in the buffer, but with enough processes running COPY
into the same table, you'll get contention for the lock.
Ah so! So after each tuple is written to shared memory there is a giant
scramble for the lock. Explains all the semops and timer sets and sleeps
in the truss output that I've been seeing.
I suspect the performance issue you're seeing is not so much the need
for a short-term lock (it's awful hard to see how to do without one,
anyway) as it is that our current spinlock implementation is pretty bad
in the face of heavy contention. On most platforms the backoff when
you can't get the lock on the first try is a ten-millisecond sleep,
which is huge overkill for locks that are only held for periods of
microseconds. There was discussion in pghackers a few months ago about
reimplementing spinlocks in some more modern way (eg, using Posix
semaphores if available) but no progress has been made yet.
I saw that thread while looking for answers to my problem. Put me in as a
"Yay" vote. My use of this system is different from most I think. I need to cram
huge amounts of data in from multiple clients, then I'll pull it all out and
process it later although there will be an occasional query to monitor progress.
I don't need fancy queries.
Are there any suggested techniques or tweaks I can make to avoid this
interference?Do you really need to run multiple COPYs in parallel, or would
serializing them be just as good? You could serialize them without
locking out readers by doingBEGIN;
LOCK TABLE foo IN EXCLUSIVE MODE;
COPY foo FROM ...
END;"EXCLUSIVE" mode isn't quite as exclusive as a plain LOCK TABLE; it
still allows other readers. See "Table-level locks" in the User's
Guide.
They don't have to be parallel in the strictest sense. As clients get
finished with jobs they will try to connect and upload data. If serializing
the process makes it faster, then that's what I'll try. Counterintuitive though
it may be. Of course getting my best performance with N=2 was counterintuitive too...
--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas
John Coers <coers@intrinsity.com> writes:
Ah so! So after each tuple is written to shared memory there is a giant
scramble for the lock. Explains all the semops and timer sets and sleeps
in the truss output that I've been seeing.
Hmm ... semops? What platform are you on, anyway, and which PG version?
It sounds like you might be running with the fallback SysV-semaphore-
based implementation of spinlocks, which is orders of magnitude more
horrid than even the normal test-and-set-based implementation. If there
is TAS support for your machine then I'd only expect to see select(2)
calls during spinlock operations. Does your version of src/include/os.h
define HAS_TEST_AND_SET?
regards, tom lane
Tom Lane wrote:
John Coers <coers@intrinsity.com> writes:
Ah so! So after each tuple is written to shared memory there is a giant
scramble for the lock. Explains all the semops and timer sets and sleeps
in the truss output that I've been seeing.Hmm ... semops? What platform are you on, anyway, and which PG version?
It sounds like you might be running with the fallback SysV-semaphore-
based implementation of spinlocks, which is orders of magnitude more
horrid than even the normal test-and-set-based implementation. If there
is TAS support for your machine then I'd only expect to see select(2)
calls during spinlock operations. Does your version of src/include/os.h
define HAS_TEST_AND_SET?regards, tom lane
Solaris 5.7, postgres 7.1
I see this in src/include/os.h:
#elif defined(__sparc__)
typedef unsigned char slock_t;
#define HAS_TEST_AND_SET
I've attached a little facsinating truss output. 7687 is the postmaster and the
rest are backends.
--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas
Attachments:
tmptext/plain; charset=us-ascii; name=tmpDownload
Hmm ... semops? What platform are you on, anyway, and which
PG version?
It sounds like you might be running with the fallback SysV-semaphore-
based implementation of spinlocks, which is orders of magnitude more
access/heap/hio.c:RelationGetBufferForRelation() uses LockPage
(ie lmgr -> semops) to syncronize table extending. Probably we could
optimize this somehow, but allocation of new page in bufmgr is
horrible and that's why we have locks in hio.c from the beginning.
Vadim
Import Notes
Resolved by subject fallback
John Coers <coers@intrinsity.com> writes:
I've attached a little facsinating truss output.
You probably shouldn't have sent 250k of trace to the whole mailing
list. But it *is* fascinating.
I tried to duplicate the result locally with no success. For a
10000-row COPY FROM STDIN (using the tenk1 table from the regression
database, which has several indexes), my trace contains:
lseek: 14060
read: 4261
write: 4967
recv: 43
other: 170
with a large part of the "other" being process startup/shutdown
overhead. There are no semops or setitimers at all in my trace.
There do seem to be some excess lseeks, but I don't see lots and
lots of seeks with no read or write as you show.
Could we see the schema for the table you are using?
("pg_dump -s -t tablename dbname" is the best way to show it.)
regards, tom lane
Tom Lane wrote:
John Coers <coers@intrinsity.com> writes:
I've attached a little facsinating truss output.
You probably shouldn't have sent 250k of trace to the whole mailing
list. But it *is* fascinating.
I thought about that right as I hit "Send." Apologies. At least it wasn't a Meg. :D
I tried to duplicate the result locally with no success. For a
10000-row COPY FROM STDIN (using the tenk1 table from the regression
database, which has several indexes), my trace contains:
I don't think the semops occur when only 1 client is performing a COPY,
so it makes sense you got different results.
I am doing this with a C executable. The code is pretty ugly because I've been trying
all sorts of different (mostly stupid) things to see what affected performance. Here
is an outline:
1) Try to connect. If there are already max_connections connected, I wait a second (or
a small rand() number of seconds) and try again.
2) do an insert of 10 rows into the 'tests' table and hold onto the OID from each insert
3) query the 'monlibs' table to see if the db knows the code I am about to send it.
If it doesn't, then I COPY that (small) amount of data to table 'mongrids' and insert a
signature of 3 strings into 'monlibs'.
Here is the part that beats on the server:
4) I then COPY upto 7500 rows of data of 3 Ints for each of the 10 test oids to the 'moncoverage' table.
The number of rows is weighted oddly and averages 16000 rows total, but could obviously go up to 75000.
There are 150 clients doing this every 5-15 minutes depending on how many rows of data are generated.
AT some point I have 250-300 machines doing it.
lseek: 14060
read: 4261
write: 4967
recv: 43
other: 170with a large part of the "other" being process startup/shutdown
overhead. There are no semops or setitimers at all in my trace.
There do seem to be some excess lseeks, but I don't see lots and
lots of seeks with no read or write as you show.Could we see the schema for the table you are using?
("pg_dump -s -t tablename dbname" is the best way to show it.)
Attached. It is less than 250k :D
I have tried the Exclusive Lock idea you sent earlier and it seems to
help significantly. I plan on trying that in combination with 8 dbs
on two disks and have each client randomly pick one. Is there a query
to see whether a table is locked?
--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas
Attachments:
pg7.dumptext/plain; charset=us-ascii; name=pg7.dumpDownload
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
access/heap/hio.c:RelationGetBufferForRelation() uses LockPage
(ie lmgr -> semops) to syncronize table extending.
But no semop should occur unless somebody is actually blocking on
the lock. John's trace only showed one active backend, so I figured
that there shouldn't be any blockage.
Probably we could
optimize this somehow, but allocation of new page in bufmgr is
horrible and that's why we have locks in hio.c from the beginning.
See later message about eliminating lseeks --- I think we should be
able to avoid doing this lock for every single tuple, as it does now,
and only do it when we need to allocate a new page.
regards, tom lane
Probably we could
optimize this somehow, but allocation of new page in bufmgr is
horrible and that's why we have locks in hio.c from the beginning.See later message about eliminating lseeks --- I think we should be
able to avoid doing this lock for every single tuple, as it does now,
and only do it when we need to allocate a new page.
Please test current/your hio.c versions when multiple users
insert records into the same table.
Vadim
Import Notes
Resolved by subject fallback
Tom Lane wrote:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
access/heap/hio.c:RelationGetBufferForRelation() uses LockPage
(ie lmgr -> semops) to syncronize table extending.But no semop should occur unless somebody is actually blocking on
the lock. John's trace only showed one active backend, so I figured
that there shouldn't be any blockage.
Keep in mind that I only sent a sample of trace data to show the kind of
things I was seeing, not the entire process from end-to-end.
--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas
I have committed some changes into development sources that should
reduce contention overhead when multiple backends are inserting into the
same table. If you're interested in trying it out, you could pull the
latest sources from our CVS server, or try back-patching the changes
into 7.1.*. The relevant changes are in these files:
2001-05-12 15:58 tgl
* src/: backend/access/heap/heapam.c, backend/access/heap/hio.c,
backend/storage/buffer/bufmgr.c, backend/storage/buffer/localbuf.c,
include/storage/bufmgr.h: Modify RelationGetBufferForTuple() so
that we only do lseek and lock when we need to move to a new page;
as long as we can insert the new tuple on the same page as before,
we only need LockBuffer and not the expensive stuff. Also, twiddle
bufmgr interfaces to avoid redundant lseeks in
RelationGetBufferForTuple and BufferAlloc. Successive inserts now
require one lseek per page added, rather than one per tuple with
several additional ones at each page boundary as happened before.
Lock contention when multiple backends are inserting in same table
is also greatly reduced.
2001-05-10 16:38 tgl
* src/: backend/commands/sequence.c,
backend/storage/buffer/bufmgr.c, backend/storage/smgr/md.c,
backend/storage/smgr/mm.c, backend/storage/smgr/smgr.c,
include/storage/smgr.h: Avoid unnecessary lseek() calls by cleanups
in md.c. mdfd_lstbcnt was not being consulted anywhere, so remove
it and remove the _mdnblocks() calls that were used to set it.
Change smgrextend interface to pass in the target block number (ie,
current file length) --- the caller always knows this already,
having already done smgrnblocks(), so it's silly to do it over
again inside mdextend. Net result: extension of a file now takes
one lseek(SEEK_END) and a write(), not three lseeks and a write.
regards, tom lane
I rebuilt with these changes and am seeing a 30-40% improvement in average performance for
my high-contention high-volume copy case. I still need to do some more testing, but this really
seems to help.
Thanks a lot Tom!
Tom Lane wrote:
I have committed some changes into development sources that should
reduce contention overhead when multiple backends are inserting into the
same table. If you're interested in trying it out, you could pull the
latest sources from our CVS server, or try back-patching the changes
into 7.1.*. The relevant changes are in these files:2001-05-12 15:58 tgl
* src/: backend/access/heap/heapam.c, backend/access/heap/hio.c,
backend/storage/buffer/bufmgr.c, backend/storage/buffer/localbuf.c,
include/storage/bufmgr.h: Modify RelationGetBufferForTuple() so
that we only do lseek and lock when we need to move to a new page;
as long as we can insert the new tuple on the same page as before,
we only need LockBuffer and not the expensive stuff. Also, twiddle
bufmgr interfaces to avoid redundant lseeks in
RelationGetBufferForTuple and BufferAlloc. Successive inserts now
require one lseek per page added, rather than one per tuple with
several additional ones at each page boundary as happened before.
Lock contention when multiple backends are inserting in same table
is also greatly reduced.2001-05-10 16:38 tgl
* src/: backend/commands/sequence.c,
backend/storage/buffer/bufmgr.c, backend/storage/smgr/md.c,
backend/storage/smgr/mm.c, backend/storage/smgr/smgr.c,
include/storage/smgr.h: Avoid unnecessary lseek() calls by cleanups
in md.c. mdfd_lstbcnt was not being consulted anywhere, so remove
it and remove the _mdnblocks() calls that were used to set it.
Change smgrextend interface to pass in the target block number (ie,
current file length) --- the caller always knows this already,
having already done smgrnblocks(), so it's silly to do it over
again inside mdextend. Net result: extension of a file now takes
one lseek(SEEK_END) and a write(), not three lseeks and a write.regards, tom lane
--
John Coers Intrinsity, Inc.
coers@intrinsity.com Austin, Texas