PG_DUMP very slow because of STDOUT ??
Hi,
I am new to the mailing list (as a writer ... reading since a long time), but am using PostgreSQL in our company for quite some time now.
Lately, we rolled out some new servers with vastly improved hardware (8 disk RAID 1+0, instead of old 2 disk RAID 1 etc. etc.), new OS (Ubuntu 10.04 server instead of Debian Lenny) and newer PostgreSQL version (8.4.4 instead of 8.3.8). And everything seemed to be much better - as expected - until, well until I have seen the first full backup run-times.
For example:
- One of our old machines needed approximately 2 ½ hours to make a dump (with "-Fc -compress=9") of our approximately 100 GByte DB.
- One of our new machines with the copy of the same 100 GByte DB needed 6 ½ to 7 hours to dump the same DB.
First check of our MRTG monitoring showed, that during the backup, the machine was not saturated in any way by load (IO Wait were below 5%), CPUs were almost idling so you got the feeling, that machine was sleeping ... But of course it did the backup, only that it was very slow.
No I further investigated. I also did some detailed monitoring with collect (http://collectl.sourceforge.net/), seeing, that the machine often was doing some slow reading and every 20-30 seconds wrote down a few Mbytes on the backup disk. So, almost everything suggests that we are very very far from some hardware limitations. I also checked some file write/ reads on the RAID, and could confirm, that plain DD of some multy GByte file only took a snap.
OK, then I checked what PG_DUMP was doing. In the process list of PostgreSQL (viewed via pgAdmin), I see how PG_DUMP works (didn't care much about it until now). In does a COPY to STDOUT. Now I wanted to narrow down the problem, and tried to reproduce that COPY statement of one of our 8 GByte DBs (that table only has INT, BIGINT, BOOLEAN fields ! So no potential BLOB problem!). And now comes the interesting part.
- When I did "COPY public.bigtable (column1, column2, column3) TO '/var/tmp/test.dump';" it is FAST. It writes down the dump file of 3 GBytes in about 2 ½ minutes (as expected with the hardware)!
- When I did "time sudo -u postgres /var/software/pgsql1/bin/psql -h /tmp/pgsql1 -p 4306 -d database_name -f copy-test.sql > /var/tmp/test.dump" (and copy-test.sql has : "COPY public.cachereleaseartist (releasebyterritoryid, artistid, isdisplayartist, artistroleclusterid, artistroletypeid, iscontainedinrelease, iscontainedinresource) TO STDOUT;" ) ... I couldn't wait until it ended .. after 20 minutes the test.dump file was merely at 1 Gbyte (far from the target size of 3 Gbyte).
I also monitored both statements via collect, and could confirm that the direct COPY to file made some reasonable IO activity, while the COPY via STDOUT was idling like mad, with some occasional bursts and falling asleep again. This would also make clear, why PG_DUMP is so dog slow. If it gets its data via STDOUT at that speed, then it falls asleep too ....
And to make things even worse: I did the same test on our old hardware too. And BOTH, COPY to file directly AND COPY via STDOUT war equally fast (took 5 ½ minutes ... but OK, that's an old iron compared to the new one)!!! No difference between file and STDOUT (as expected)!
Now I ask, whats going on here ???? Why is COPY via STDOUT so much slower on out new machine?
- Has anything changed between PostgreSQL 8.3.8 and PostgreSQL 8.4.4 in this respect? Maybe some STDOUT buffer flushing from COPY? Buffer handling? Buffer size???
- Is maybe the OS to blame? As I told, we changed from Debian Lenny (Kernel 2.6.26-2) to Ubuntu 10.04 Server (Kernel 2.6.32-22) (both 64-bit)? And if yes, would there be ways for PostgreSQL to adapt to the new OSs new behavior in STDOUT (if there is such thing)???
And for the Hardware:
- Old machine was an 8 Core with 16 GByte RAM BUT only one RAID 1 (software!) with two disks for everything (I know, that was a little bit meager).
- New machine has 8 Core (with hyper-threading 16), 24 GByte RAM, and a RAID 1+0 with 8 disks for Data only, and XLOG/Backup/OS on a second RAID 1 with 2 disks, with hardware raid controller and battery backed cache (so, obviously, this machine should be faster than the old one)
Can someone shed some light on this STDOUT madness? Or give me some directions/hints in which I could further research?
Thank you very much!
Andras Fabian
Atrada Trading Network AG
Andras Fabian <Fabian@atrada.net> writes:
Now I ask, whats going on here ???? Why is COPY via STDOUT so much slower on out new machine?
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?
On a Red Hat box I would try using oprofile to see where the bottleneck
is ... don't know if that's available for Ubuntu.
regards, tom lane
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?
Hmm, no ... but how do Unix-socket / TCP connections relate to STDOUT here (sorry, maybe this is obvious to some, but not to me at the moment)?
And doing some profiling came to my mind too, but didn't have much experience in that field until now (at least not on basic OS related stuff). I might check out oprofile (yes, it seems to exist for Ubuntu) next week (now its time to leave for the weekend :-)
And maybe others have some more ideas until then ... (or have experience this or similar weirdness too).
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Gesendet: Freitag, 9. Juli 2010 15:39
An: Andras Fabian
Cc: pgsql-general@postgresql.org
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ??
Andras Fabian <Fabian@atrada.net> writes:
Now I ask, whats going on here ???? Why is COPY via STDOUT so much slower on out new machine?
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?
On a Red Hat box I would try using oprofile to see where the bottleneck
is ... don't know if that's available for Ubuntu.
regards, tom lane
Andras Fabian <Fabian@atrada.net> writes:
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?
Hmm, no ... but how do Unix-socket / TCP connections relate to STDOUT here (sorry, maybe this is obvious to some, but not to me at the moment)?
Well, COPY TO STDOUT really means "send the data across the network
connection to the client". It sounds to me like your problem is slow
data transfer between the database backend and pg_dump (or psql).
Not sure what could be causing it, unless perhaps you've got a kernel
packet filter that's performing an unreasonable amount of processing
per packet.
regards, tom lane
On 09/07/10 22:08, Tom Lane wrote:
Andras Fabian <Fabian@atrada.net> writes:
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?Hmm, no ... but how do Unix-socket / TCP connections relate to STDOUT here (sorry, maybe this is obvious to some, but not to me at the moment)?
Well, COPY TO STDOUT really means "send the data across the network
connection to the client". It sounds to me like your problem is slow
data transfer between the database backend and pg_dump (or psql).
Not sure what could be causing it, unless perhaps you've got a kernel
packet filter that's performing an unreasonable amount of processing
per packet.
... and that should be showing up as high "system" CPU time, or it seems
to on my machines with lots of tcp/ip traffic and heavy netfilter rules.
If they're using unix sockets packet filtering shouldn't come into play
anyway.
Andrais: Are you using UNIX sockets or TCP/IP to connect to PostgreSQL?
If you specify any host to connect to, even "localhost", you're using
TCP/IP. If you don't specify a host, you're using unix sockets.
Whichever you are doing, try the other one and see if performance is
different.
Idea: One notable difference between COPY to file and 'copy to stdout"
to a receiving process over the network is that the latter involves the
interaction of two processes. When a kernel change is involved and
you're seeing low performance with lots of apparent idleness, that
immediately makes me think "process scheduler".
Of course, that might be completely off-track, but I've seen weird
performance issues caused by scheduler changes before.
--
Craig Ringer
Hi guys,
well definitely thanks for these first ideas. They at least point in a direction which I havn't considered before (to be honest, I didn't even think about associating STDOUT with some socket stuff).
With my experiments, I think I have essentially tried both socket types. Once I normally connected via pgAdmin to my server (this would definitely qualify as TCP/IP I think :-), and then also did locally on the machine via "psql" (see the example in my original post). So, in that case, it must have been unix sockets. And both cases behaved similarly when issuing the COPY TO STDOUT (i.e.: sloooowly).
And for difference between COPY TO FILE and and COPY TO STDOUT ... when I did the same experiment on old machine/kernel, there was almost no difference between execution time (almost identical, and fast as expected) - that test was done from "psql"/command line ... so unix-sockets.
So, of course, there might be something with our new kernel. But then many others could have this "problem" too! Because we are using the stock kernel of the latest Ubuntu 10.04 LTS 64-bit server (nothing self compiled, just as is,and as it comes from Canonical after you downloaded it). Maybe, someone having a machine with Ubuntu 10.04 (server!) and a PostgreSQL on it, could try to reproduce it too.
By the way, how could I experiment with the scheduler? I remember, that there were some magic trick via /proc file system or something like that to "tweak" it on the fly ....
Andras Fabian
________________________________________
Von: Craig Ringer [craig@postnewspapers.com.au]
Gesendet: Freitag, 9. Juli 2010 22:11
An: Tom Lane
Cc: Andras Fabian; pgsql-general@postgresql.org
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ??
On 09/07/10 22:08, Tom Lane wrote:
Andras Fabian <Fabian@atrada.net> writes:
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?Hmm, no ... but how do Unix-socket / TCP connections relate to STDOUT here (sorry, maybe this is obvious to some, but not to me at the moment)?
Well, COPY TO STDOUT really means "send the data across the network
connection to the client". It sounds to me like your problem is slow
data transfer between the database backend and pg_dump (or psql).
Not sure what could be causing it, unless perhaps you've got a kernel
packet filter that's performing an unreasonable amount of processing
per packet.
... and that should be showing up as high "system" CPU time, or it seems
to on my machines with lots of tcp/ip traffic and heavy netfilter rules.
If they're using unix sockets packet filtering shouldn't come into play
anyway.
Andrais: Are you using UNIX sockets or TCP/IP to connect to PostgreSQL?
If you specify any host to connect to, even "localhost", you're using
TCP/IP. If you don't specify a host, you're using unix sockets.
Whichever you are doing, try the other one and see if performance is
different.
Idea: One notable difference between COPY to file and 'copy to stdout"
to a receiving process over the network is that the latter involves the
interaction of two processes. When a kernel change is involved and
you're seeing low performance with lots of apparent idleness, that
immediately makes me think "process scheduler".
Of course, that might be completely off-track, but I've seen weird
performance issues caused by scheduler changes before.
--
Craig Ringer
Hi Tom (or others),
are there some recommended settings/ways to use oprofile on a situation like this??? I got it working, have seen a first profile report, but then managed to completely freeze the server on a second try with different oprofile settings (next tests will go against the newly installed - next and identical - new servers).
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Gesendet: Freitag, 9. Juli 2010 15:39
An: Andras Fabian
Cc: pgsql-general@postgresql.org
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ??
Andras Fabian <Fabian@atrada.net> writes:
Now I ask, whats going on here ???? Why is COPY via STDOUT so much slower on out new machine?
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?
On a Red Hat box I would try using oprofile to see where the bottleneck
is ... don't know if that's available for Ubuntu.
regards, tom lane
This STDOU issue gets even weirder. Now I have set up our two new servers (identical hw/sw) as I would have needed to do so anyways. After having PG running, I also set up the same test scenario as I have it on our problematic servers, and started the COPY-to-STDOUT experiment. And you know what? Both new servers are performing well. No hanging, and the 3 GByte test dump was written in around 3 minutes (as expected). To make things even more complicated ... I went back to our production servers. Now, the first one - which I froze up with oprofile this morning and needed a REBOOT - is performing well too! It needed 3 minutes for the test case ... WTF? BUT, the second production server, which did not have a reboot, is still behaving badly.
Now I tried to dig deeper (without killing a production server again) ... and came to comparing the outputs of PS (with '-fax' parameter then, '-axl'). Now I have found something interesting:
- all fast servers show the COPY process as being in the state Rs ("runnable (on run queue)")
- on the still slow server, this process is in 9 out of 10 samples in Ds ("uninterruptible sleep (usually IO)")
Now, this "Ds" state seems to be something unhealthy - especially if it is there almost all the time - as far as my first reeds on google show (and although it points to IO, there is seemingly only very little IO, and IO-wait is minimal too). I have also done "-axl" with PS, which brings the following line for our process:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 5551 2819 4201 20 0 5941068 201192 conges Ds ? 2:05 postgres: postgres musicload_cache [local] COPY"
Now, as far as I understood from my google searches, the column WCHAN shows, where in the kernel my process is hanging. Here it says "conges". Now, can somebody tell me, what "conges" means ???? Or do I have other options to get out even more info from the system (maybe without oprofile - as it already burned my hand :-).
And yes, now I see a reboot as a possible "Fix", but that would not ensure me, that the problem will not resurface. So, for the time being, I will leave my current second production server as is ... so I can further narrow down the potential reasons of this strange STDOUT slow down (especially I someone ha s a tip for me :-)
Andras Fabian
(in the meantime my "slow" server finished the COPY ... it took 46 minutes instead of 3 minutes on the fast machines ... a slowdown of factor 15).
-----Ursprüngliche Nachricht-----
Von: Andras Fabian
Gesendet: Montag, 12. Juli 2010 10:45
An: 'Tom Lane'
Cc: pgsql-general@postgresql.org
Betreff: AW: [GENERAL] PG_DUMP very slow because of STDOUT ??
Hi Tom (or others),
are there some recommended settings/ways to use oprofile on a situation like this??? I got it working, have seen a first profile report, but then managed to completely freeze the server on a second try with different oprofile settings (next tests will go against the newly installed - next and identical - new servers).
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Gesendet: Freitag, 9. Juli 2010 15:39
An: Andras Fabian
Cc: pgsql-general@postgresql.org
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ??
Andras Fabian <Fabian@atrada.net> writes:
Now I ask, whats going on here ???? Why is COPY via STDOUT so much slower on out new machine?
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?
On a Red Hat box I would try using oprofile to see where the bottleneck
is ... don't know if that's available for Ubuntu.
regards, tom lane
Andras Fabian wrote:
- all fast servers show the COPY process as being in the state Rs ("runnable (on run queue)")
- on the still slow server, this process is in 9 out of 10 samples in Ds ("uninterruptible sleep (usually IO)")
I've run into significant performance regressions in PostgreSQL
performance due to issues with the Linux scheduler before, specifically
when running a single really intensive client program. You might be
seeing something similar here. I wrote a reference link heavy blog
entry about that at
http://notemagnet.blogspot.com/2008/05/pgbench-suffering-with-linux-2623-2626.html
you might find useful, one of the batch scheduler tweaks alluded to
there might improve things. Regression here in newer kernels are the
norm rather than the exception, and given the general lack of quality
control in Ubuntu 10.04 I have avoided any performance testing of it
yet. I was going to give it six months after release before I even
thought about that, in hopes more bugs are squashed, but I'm not
optimistic about this distribution for server use at all right now.
There's more information about using oprofile at
http://wiki.postgresql.org/wiki/Profiling_with_OProfile that might help
you dig into the underlying spot it's stuck at.
--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Jul 12, 2010 at 7:03 AM, Andras Fabian <Fabian@atrada.net> wrote:
This STDOU issue gets even weirder. Now I have set up our two new servers (identical hw/sw) as I would have needed to do so anyways. After having PG running, I also set up the same test scenario as I have it on our problematic servers, and started the COPY-to-STDOUT experiment. And you know what? Both new servers are performing well. No hanging, and the 3 GByte test dump was written in around 3 minutes (as expected). To make things even more complicated ... I went back to our production servers. Now, the first one - which I froze up with oprofile this morning and needed a REBOOT - is performing well too! It needed 3 minutes for the test case ... WTF? BUT, the second production server, which did not have a reboot, is still behaving badly.
I'm gonna take a scientific wild-assed guess that your machine was
rebuilding RAID arrays when you started out, and you had massive IO
contention underneath the OS level resulting in such a slow down.
Note that you mentioned ~5% IO Wait. That's actually fairly high if
you've got 8 to 16 cores or something like that. It's much better to
use iostat -xd 60 or something like that and look for IO Utilization
at the end of the lines.
Again, just a guess.
On 12/07/10 21:03, Andras Fabian wrote:
This STDOU issue gets even weirder. Now I have set up our two new servers (identical hw/sw) as I would have needed to do so anyways. After having PG running, I also set up the same test scenario as I have it on our problematic servers, and started the COPY-to-STDOUT experiment. And you know what? Both new servers are performing well. No hanging, and the 3 GByte test dump was written in around 3 minutes (as expected). To make things even more complicated ... I went back to our production servers. Now, the first one - which I froze up with oprofile this morning and needed a REBOOT - is performing well too! It needed 3 minutes for the test case ... WTF? BUT, the second production server, which did not have a reboot, is still behaving badly.
Now I tried to dig deeper (without killing a production server again) ... and came to comparing the outputs of PS (with '-fax' parameter then, '-axl'). Now I have found something interesting:
- all fast servers show the COPY process as being in the state Rs ("runnable (on run queue)")
- on the still slow server, this process is in 9 out of 10 samples in Ds ("uninterruptible sleep (usually IO)")Now, this "Ds" state seems to be something unhealthy - especially if it is there almost all the time - as far as my first reeds on google show (and although it points to IO, there is seemingly only very little IO, and IO-wait is minimal too). I have also done "-axl" with PS, which brings the following line for our process:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 5551 2819 4201 20 0 5941068 201192 conges Ds ? 2:05 postgres: postgres musicload_cache [local] COPY"
Your wchan column isn't wide enough to show the full function name, but
I'd say it's related to some form of throttling or congestion control.
Get a wider view of that column to find out what the full function name
is. Grepping the kernel source for it can then tell you a lot about
where in the kernel it is and what might be going on.
Try:
ps ax -O wchan:40
to get a decently wide view of that col.
--
Craig Ringer
Tech-related writing: http://soapyfrogs.blogspot.com/
Hi Scott,
Although I can't guarantee for 100% that there was no RAID rebuild at some point, I am almost sure that it wasn't the case. Two machines - the ones which were already in production - exhibited this problem. Both of them were already up for some weeks. Now, the reboot rather "fixed" one of them instead of making it worse (as your theory goes this way) the problem "disappeared" (but I don't know for how long). Now, only one of the production machines has the issue ... the one which wasn't rebooted. Strange, strange. Nevertheless thank you for your idea ... this is exactly the way I try to approach the problem, by making some theories and trying to prove or disapprove them :-)
Now I will try to further investigate along the tips from Craig and Greg.
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Scott Marlowe [mailto:scott.marlowe@gmail.com]
Gesendet: Dienstag, 13. Juli 2010 03:43
An: Andras Fabian
Cc: Tom Lane; pgsql-general@postgresql.org
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ??
On Mon, Jul 12, 2010 at 7:03 AM, Andras Fabian <Fabian@atrada.net> wrote:
This STDOU issue gets even weirder. Now I have set up our two new servers (identical hw/sw) as I would have needed to do so anyways. After having PG running, I also set up the same test scenario as I have it on our problematic servers, and started the COPY-to-STDOUT experiment. And you know what? Both new servers are performing well. No hanging, and the 3 GByte test dump was written in around 3 minutes (as expected). To make things even more complicated ... I went back to our production servers. Now, the first one - which I froze up with oprofile this morning and needed a REBOOT - is performing well too! It needed 3 minutes for the test case ... WTF? BUT, the second production server, which did not have a reboot, is still behaving badly.
I'm gonna take a scientific wild-assed guess that your machine was
rebuilding RAID arrays when you started out, and you had massive IO
contention underneath the OS level resulting in such a slow down.
Note that you mentioned ~5% IO Wait. That's actually fairly high if
you've got 8 to 16 cores or something like that. It's much better to
use iostat -xd 60 or something like that and look for IO Utilization
at the end of the lines.
Again, just a guess.
On Tue, Jul 13, 2010 at 12:31 AM, Andras Fabian <Fabian@atrada.net> wrote:
Hi Scott,
Although I can't guarantee for 100% that there was no RAID rebuild at some point, I am almost sure that it wasn't the case. Two machines - the ones which were already in production - exhibited this problem. Both of them were already up for some weeks. Now, the reboot rather "fixed" one of them instead of making it worse (as your theory goes this way) the problem "disappeared" (but I don't know for how long). Now, only one of the production machines has the issue ... the one which wasn't rebooted. Strange, strange. Nevertheless thank you for your idea ... this is exactly the way I try to approach the problem, by making some theories and trying to prove or disapprove them :-)
Now I will try to further investigate along the tips from Craig and Greg.
Was there maybe a kernel update that hadn't been applied by reboot?
Hi Scott,
No, we didn't have a kernel update (it is still the stock Ubuntu 10.04 Server kernel ... 2.6.32.2). And in the meantime - this morning - I have discovered, that the rebooted server is again slowing down! It is not at the level of the not-rebooted-server (about 45 mins for the 3 Gig file)... it "only" needs 22 minutes, but it is already quite a bit away from the optimum of 3 minutes (or less). So, definitely, something is "deteriorating" in the system ...
And I also did dome readings with iostat -xd 5 ... And the target drive to which the output of the STDOUT is directed is below 1% utilization (mostly around 0.2 - 0.4%) with rare "peaks around 2-3% when it does a little bit more. And this is maybe one of the interesting observations. It seems to periodically "flush" a bit more out, just to fall asleep again (with minimum write activity). The drive, from which the reads come (the one, where PG-s data files are ... it is the 8-disk RAID 10), has a little bit more activity (utilization 6-8%) but this data is also - concurrently - in use by some apps reading from the DB (just, normal traffic on the DB).
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Scott Marlowe [mailto:scott.marlowe@gmail.com]
Gesendet: Dienstag, 13. Juli 2010 09:45
An: Andras Fabian
Cc: Tom Lane; pgsql-general@postgresql.org
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ??
On Tue, Jul 13, 2010 at 12:31 AM, Andras Fabian <Fabian@atrada.net> wrote:
Hi Scott,
Although I can't guarantee for 100% that there was no RAID rebuild at some point, I am almost sure that it wasn't the case. Two machines - the ones which were already in production - exhibited this problem. Both of them were already up for some weeks. Now, the reboot rather "fixed" one of them instead of making it worse (as your theory goes this way) the problem "disappeared" (but I don't know for how long). Now, only one of the production machines has the issue ... the one which wasn't rebooted. Strange, strange. Nevertheless thank you for your idea ... this is exactly the way I try to approach the problem, by making some theories and trying to prove or disapprove them :-)
Now I will try to further investigate along the tips from Craig and Greg.
Was there maybe a kernel update that hadn't been applied by reboot?
Craig, thanks for that PS tip (you think, you have used PS for such a long time, but it still has some new tricks available).
And here is the more readable line:
26390 congestion_wait D ? 00:00:26 postgres: postgres musicload_cache [local] COPY
So the kernel function it is always idling on seems to be congestion_wait ... I am already looking around at google and in the kernel source (for 2.6.32.2)
http://lxr.linux.no/#linux+v2.6.32.2/mm/backing-dev.c#L762
/**
* congestion_wait - wait for a backing_dev to become uncongested
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
* write congestion. If no backing_devs are congested then just wait for the
* next write to be completed.
*/
So, obviously, for some reason we are waiting too much for a backind_device ... which ever it is at the moment. Because, as I just wrote to Scott Marlowe, the disk system is almost idling (have seen disk utilization on the drive to which I write below 1%).
So, the question seems to be, why and where this "idling happens".
Just as a test, I have tried a very simple piping example (which should go trough STDOUT too ... or am I wrong).
- "dd if=/dev/zero of=file_10GB bs=1024 count=10M" created a 10 GB test file on source drive (sdb)
- "time cat file_10GB > /var/tmp/test2.dump" ... pipe the file to target drive (/var/tmp is on sda)
This take only a bit over 2 minutes to complete ... and although, one sees congestion_wait in PS from time to time, it is not dominant, and there are other EXT4 or IO related function shown. ... But maybe this mini-experiment is lacking some important steps, that are in the path of a COPY-to-STDOUT from PostgreSQL.
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Craig Ringer [mailto:craig@postnewspapers.com.au]
Gesendet: Dienstag, 13. Juli 2010 05:03
An: Andras Fabian
Cc: Tom Lane; pgsql-general@postgresql.org
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ??
On 12/07/10 21:03, Andras Fabian wrote:
This STDOU issue gets even weirder. Now I have set up our two new servers (identical hw/sw) as I would have needed to do so anyways. After having PG running, I also set up the same test scenario as I have it on our problematic servers, and started the COPY-to-STDOUT experiment. And you know what? Both new servers are performing well. No hanging, and the 3 GByte test dump was written in around 3 minutes (as expected). To make things even more complicated ... I went back to our production servers. Now, the first one - which I froze up with oprofile this morning and needed a REBOOT - is performing well too! It needed 3 minutes for the test case ... WTF? BUT, the second production server, which did not have a reboot, is still behaving badly.
Now I tried to dig deeper (without killing a production server again) ... and came to comparing the outputs of PS (with '-fax' parameter then, '-axl'). Now I have found something interesting:
- all fast servers show the COPY process as being in the state Rs ("runnable (on run queue)")
- on the still slow server, this process is in 9 out of 10 samples in Ds ("uninterruptible sleep (usually IO)")Now, this "Ds" state seems to be something unhealthy - especially if it is there almost all the time - as far as my first reeds on google show (and although it points to IO, there is seemingly only very little IO, and IO-wait is minimal too). I have also done "-axl" with PS, which brings the following line for our process:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 5551 2819 4201 20 0 5941068 201192 conges Ds ? 2:05 postgres: postgres musicload_cache [local] COPY"
Your wchan column isn't wide enough to show the full function name, but
I'd say it's related to some form of throttling or congestion control.
Get a wider view of that column to find out what the full function name
is. Grepping the kernel source for it can then tell you a lot about
where in the kernel it is and what might be going on.
Try:
ps ax -O wchan:40
to get a decently wide view of that col.
--
Craig Ringer
Tech-related writing: http://soapyfrogs.blogspot.com/
On 13/07/2010 4:05 PM, Andras Fabian wrote:
Craig, thanks for that PS tip (you think, you have used PS for such a long time, but it still has some new tricks available).
So, obviously, for some reason we are waiting too much for a backind_device ... which ever it is at the moment. Because, as I just wrote to Scott Marlowe, the disk system is almost idling (have seen disk utilization on the drive to which I write below 1%).
A quick search suggests that most calls into congestion_wait are in the
memory management subsystem, and are involved in situations where the
system is struggling for memory. However, that includes memory consumed
by I/O buffers, writeback for I/O, etc, so it'd also be consistent with
very slow I/O causing write throttling as the system tried to write
already buffered data to disk.
Most other calls are in file system drivers.
At this point I'd be taking a closer look at "vmstat 1" and "iostat 1"
output, plus "top", to see if any interesting clues about system-wide
issues turned up.
I'd also be trying to perform each step of the problem operation in
isolation as much as possible, so as to see if I could find out what
particular part was causing the slowdown. Comparing "\copy" to "COPY ...
TO STDOUT", invoking "COPY ... TO STDOUT" with a standalone backend
writing output to an on disk file and to /dev/null, etc.
So, the question seems to be, why and where this "idling happens".
You can potentially find out more by getting a trace of the kernel
function call stack for the backend process. The kernel call stack of a
process at any given time can be obtained by reading /proc/$pid/stack .
This will tell you not only what call it's waiting in in the kernel, but
what function(s) called it, and even the code offset within each function.
Just as a test, I have tried a very simple piping example (which should go trough STDOUT too ... or am I wrong).
- "dd if=/dev/zero of=file_10GB bs=1024 count=10M" created a 10 GB test file on source drive (sdb)
- "time cat file_10GB> /var/tmp/test2.dump" ... pipe the file to target drive (/var/tmp is on sda)
Isn't your issue suspected to be with network transfers over unix
sockets and/or tcp/ip, rather than with pipes?
Try "socat" if you want to test unix socket performance and/or tcp/ip
socket performance. It's an amazing sysadmin/network swiss army knife.
--
Craig Ringer
On 13/07/2010 3:53 PM, Andras Fabian wrote:
Hi Scott,
No, we didn't have a kernel update (it is still the stock Ubuntu 10.04 Server kernel ... 2.6.32.2). And in the meantime - this morning - I have discovered, that the rebooted server is again slowing down! It is not at the level of the not-rebooted-server (about 45 mins for the 3 Gig file)... it "only" needs 22 minutes, but it is already quite a bit away from the optimum of 3 minutes (or less). So, definitely, something is "deteriorating" in the system ...
And I also did dome readings with iostat -xd 5 ... And the target drive to which the output of the STDOUT is directed is below 1% utilization (mostly around 0.2 - 0.4%) with rare "peaks around 2-3% when it does a little bit more. And this is maybe one of the interesting observations. It seems to periodically "flush" a bit more out, just to fall asleep again (with minimum write activity). The drive, from which the reads come (the one, where PG-s data files are ... it is the 8-disk RAID 10), has a little bit more activity (utilization 6-8%) but this data is also - concurrently - in use by some apps reading from the DB (just, normal traffic on the DB).
I don't think it'll be particularly helpful in this case, but you never
know, so: another information-collecting tool is "blktrace". This can
let you observe in detail exactly what's being done on a given block
device. It's helpful when you have weird I/O patterns and can't figure
out why, as it'll often reveal some process continually poking away at
some file it doesn't need to, thrashing away on a disk-backed mmap()ed
tempfile, or the like.
--
Craig Ringer
Hi Craig,
Yes, a first look at /proc/pid/stack shows something that smells like memory management ... ut least up to the point where congestion_wait is called.
------------------------------------------------------------------
[<ffffffff8110d750>] congestion_wait+0x70/0x90
[<ffffffff81101cb7>] shrink_inactive_list+0x667/0x7e0
[<ffffffff81101ec1>] shrink_list+0x91/0xf0
[<ffffffff811020b7>] shrink_zone+0x197/0x240
[<ffffffff81102886>] __zone_reclaim+0x146/0x260
[<ffffffff81102ab7>] zone_reclaim+0x117/0x150
[<ffffffff810f8fd4>] get_page_from_freelist+0x544/0x6c0
[<ffffffff810f98c9>] __alloc_pages_nodemask+0xd9/0x180
[<ffffffff81131822>] kmalloc_large_node+0x62/0xb0
[<ffffffff81135df9>] __kmalloc_node_track_caller+0x109/0x160
[<ffffffff814665f0>] __alloc_skb+0x80/0x190
[<ffffffff81462504>] sock_alloc_send_pskb+0x1c4/0x320
[<ffffffff81462675>] sock_alloc_send_skb+0x15/0x20
[<ffffffff814f54d5>] unix_stream_sendmsg+0x275/0x3e0
[<ffffffff8145f5ab>] sock_sendmsg+0x10b/0x140
[<ffffffff8145f765>] sys_sendto+0x125/0x180
[<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
------------------------------------------------------------------
I also looked at "whats happening" with strace, where you see traffic like:
------------------------------------------------------------------
....
[ 7fc153d925b2] sendto(8, "\tt\tt\nd\0\0\0\0378241003\t974369\tt\t\\N\t10"..., 8192, 0, NULL, 0) = 8192
[ 7fc153d83460] read(3, "\0\0\0\0\0\0\0\0\0\0\4\0\260\1\260\t\0 \4 \0\0\0\0\310\237d\0\220\237d\0"..., 8192) = 8192
[ 7fc153d83460] read(3, "\0\0\0\0\0\0\0\0\0\0\4\0\260\1\260\t\0 \4 \0\0\0\0\310\237d\0\220\237d\0"..., 8192) = 8192
[ 7fc153d925b2] sendto(8, "0\tt\tt\nd\0\0\0\0378241003\t830278\tt\t\\N\t1"..., 8192, 0, NULL, 0) = 8192
[ 7fc153d83460] read(3, "\0\0\0\0\0\0\0\0\0\0\4\0\260\1\260\t\0 \4 \0\0\0\0\310\237d\0\220\237d\0"..., 8192) = 8192
[ 7fc153d83460] read(3, "\0\0\0\0\0\0\0\0\0\0\4\0\260\1\260\t\0 \4 \0\0\0\0\310\237d\0\220\237d\0"..., 8192) = 8192
[ 7fc153d925b2] sendto(8, "\tt\tt\nd\0\0\0 8241006\t1114684\tt\t\\N\t1"..., 8192, 0, NULL, 0) = 8192
[ 7fc153d83460] read(3, "\0\0\0\0\0\0\0\0\0\0\4\0\260\1\260\t\0 \4 \0\0\0\0\310\237d\0\220\237d\0"..., 8192) = 8192
[ 7fc153d83460] read(3, "\0\0\0\0\0\0\0\0\0\0\4\0\260\1\260\t\0 \4 \0\0\0\0\310\237d\0\220\237d\0"..., 8192) = 8192
[ 7fc153d83460] read(3, "\0\0\0\0\0\0\0\0\0\0\4\0\260\1\260\t\0 \4 \0\0\0\0\310\237d\0\220\237d\0"..., 8192) = 8192
....
------------------------------------------------------------------
But still no definitive clue about the reasons. What is also quite interesting is, that when I start my COPY-to-STDOUT experiment, it is running quite fast in the beginning. Sometimes up to 400 Mbytes, sometimes up to 1.4 GBytes (I didn't find a real reason which I could be correlated to this) ... and then, suddenly it begins to stall. From there on, it only advances slowly with all the congestion_wait going on ... Hmm, maybe it has really something to do with the state of the memory ... (this would go well with the fact, that a freshly rebooted server is not having the problem in the beginning).
Ahh, well, and what would/could go against the file system / drive argument. My initial tests, where I did a COPY-to-FILE, I have never head problems (only COPY-to-STDOUT gives them).
Well, I will try to gather some more information with your other hints (quite a lot of stuff - and one learns a lot new tricks in the process :-) ...
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Craig Ringer [mailto:craig@postnewspapers.com.au]
Gesendet: Dienstag, 13. Juli 2010 11:01
An: Andras Fabian
Cc: Tom Lane; pgsql-general@postgresql.org
Betreff: Re: AW: [GENERAL] PG_DUMP very slow because of STDOUT ??
On 13/07/2010 4:05 PM, Andras Fabian wrote:
Craig, thanks for that PS tip (you think, you have used PS for such a long time, but it still has some new tricks available).
So, obviously, for some reason we are waiting too much for a backind_device ... which ever it is at the moment. Because, as I just wrote to Scott Marlowe, the disk system is almost idling (have seen disk utilization on the drive to which I write below 1%).
A quick search suggests that most calls into congestion_wait are in the
memory management subsystem, and are involved in situations where the
system is struggling for memory. However, that includes memory consumed
by I/O buffers, writeback for I/O, etc, so it'd also be consistent with
very slow I/O causing write throttling as the system tried to write
already buffered data to disk.
Most other calls are in file system drivers.
At this point I'd be taking a closer look at "vmstat 1" and "iostat 1"
output, plus "top", to see if any interesting clues about system-wide
issues turned up.
I'd also be trying to perform each step of the problem operation in
isolation as much as possible, so as to see if I could find out what
particular part was causing the slowdown. Comparing "\copy" to "COPY ...
TO STDOUT", invoking "COPY ... TO STDOUT" with a standalone backend
writing output to an on disk file and to /dev/null, etc.
So, the question seems to be, why and where this "idling happens".
You can potentially find out more by getting a trace of the kernel
function call stack for the backend process. The kernel call stack of a
process at any given time can be obtained by reading /proc/$pid/stack .
This will tell you not only what call it's waiting in in the kernel, but
what function(s) called it, and even the code offset within each function.
Just as a test, I have tried a very simple piping example (which should go trough STDOUT too ... or am I wrong).
- "dd if=/dev/zero of=file_10GB bs=1024 count=10M" created a 10 GB test file on source drive (sdb)
- "time cat file_10GB> /var/tmp/test2.dump" ... pipe the file to target drive (/var/tmp is on sda)
Isn't your issue suspected to be with network transfers over unix
sockets and/or tcp/ip, rather than with pipes?
Try "socat" if you want to test unix socket performance and/or tcp/ip
socket performance. It's an amazing sysadmin/network swiss army knife.
--
Craig Ringer
On 13/07/10 17:18, Andras Fabian wrote:
But still no definitive clue about the reasons. What is also quite interesting is, that when I start my COPY-to-STDOUT experiment, it is running quite fast in the beginning. Sometimes up to 400 Mbytes, sometimes up to 1.4 GBytes (I didn't find a real reason which I could be correlated to this) ... and then, suddenly it begins to stall. From there on, it only advances slowly with all the congestion_wait going on ... Hmm, maybe it has really something to do with the state of the memory ... (this would go well with the fact, that a freshly rebooted server is not having the problem in the beginning).
Is there any chance something is buffering the output of the COPY ... TO
STDOUT? Whether it's a pipe/socket buffer that grows to a certain limit,
then begins to be limited by the ability of the receiver to empty it, or
whether it's the receiver buffering the whole COPY in memory before
writing it out, I don't know.
Do you see anything growing in memory use on the system while the COPY
is in progress? Any of the memory use stats in "free -m" changing"?
--
Craig Ringer
I had another observation before your last mail:
I have compared the /proc/pid/sched Stats of a normal and a slow machine. And there were two counters that really sticked out:
- se.sleep_start
- se.block_start
On a normal machine, both counter remaind at 0 all the time while doing COPY-to-STDOUT. On a "slow" machine, those counters - but always only one of them - got some really big numbers while the process was stuck in congestion_wait.
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Craig Ringer [mailto:craig@postnewspapers.com.au]
Gesendet: Dienstag, 13. Juli 2010 11:01
An: Andras Fabian
Cc: Tom Lane; pgsql-general@postgresql.org
Betreff: Re: AW: [GENERAL] PG_DUMP very slow because of STDOUT ??
On 13/07/2010 4:05 PM, Andras Fabian wrote:
Craig, thanks for that PS tip (you think, you have used PS for such a long time, but it still has some new tricks available).
So, obviously, for some reason we are waiting too much for a backind_device ... which ever it is at the moment. Because, as I just wrote to Scott Marlowe, the disk system is almost idling (have seen disk utilization on the drive to which I write below 1%).
A quick search suggests that most calls into congestion_wait are in the
memory management subsystem, and are involved in situations where the
system is struggling for memory. However, that includes memory consumed
by I/O buffers, writeback for I/O, etc, so it'd also be consistent with
very slow I/O causing write throttling as the system tried to write
already buffered data to disk.
Most other calls are in file system drivers.
At this point I'd be taking a closer look at "vmstat 1" and "iostat 1"
output, plus "top", to see if any interesting clues about system-wide
issues turned up.
I'd also be trying to perform each step of the problem operation in
isolation as much as possible, so as to see if I could find out what
particular part was causing the slowdown. Comparing "\copy" to "COPY ...
TO STDOUT", invoking "COPY ... TO STDOUT" with a standalone backend
writing output to an on disk file and to /dev/null, etc.
So, the question seems to be, why and where this "idling happens".
You can potentially find out more by getting a trace of the kernel
function call stack for the backend process. The kernel call stack of a
process at any given time can be obtained by reading /proc/$pid/stack .
This will tell you not only what call it's waiting in in the kernel, but
what function(s) called it, and even the code offset within each function.
Just as a test, I have tried a very simple piping example (which should go trough STDOUT too ... or am I wrong).
- "dd if=/dev/zero of=file_10GB bs=1024 count=10M" created a 10 GB test file on source drive (sdb)
- "time cat file_10GB> /var/tmp/test2.dump" ... pipe the file to target drive (/var/tmp is on sda)
Isn't your issue suspected to be with network transfers over unix
sockets and/or tcp/ip, rather than with pipes?
Try "socat" if you want to test unix socket performance and/or tcp/ip
socket performance. It's an amazing sysadmin/network swiss army knife.
--
Craig Ringer