buildfarm: strange OOM failures on markhor (running CLOBBER_CACHE_RECURSIVELY)
Hi all,
a few days ago I setup an buildfarm animal markhor, running the tests
with CLOBBER_CACHE_RECURSIVELY. As the tests are running very long,
reporting the results back to the server fails because of a safeguard
limit in the buildfarm server. Anyway, that's being discussed in a
different thread - here it's merely as a 'don't bother looking for addax
on the buildfarm website' warning.
I've been checking the progress of the recursive tests today, and I
found it actually failed in the 'make check' step. The logs are
available here:
buildfarm logs: http://www.fuzzy.cz/tmp/buildfarm/recursive-oom.tgz
kernel logs: http://www.fuzzy.cz/tmp/buildfarm/messages
The tests are running within a LXC container (operated through libvirt),
so whenever I say 'VM' I actually mean a LXC container. It might be some
VM/LXC misconfiguration, but as this happens only to a single VM (the
one running tests with recursive clobber), I find it unlikely.
================== An example of the failure ==================
parallel group (20 tests): pg_lsn regproc oid name char money float4
txid text int2 varchar int4 float8 boolean int8 uuid rangetypes bit
numeric enum
...
float4 ... ok
float8 ... ok
bit ... FAILED (test process exited with exit code 2)
numeric ... FAILED (test process exited with exit code 2)
txid ... ok
...
===============================================================
and then of course the usual 'terminating connection because of crash of
another server process' warning. Apparently, it's getting killed by the
OOM killer, because it exhausts all the memory assigned to that VM (2GB).
May 15 19:44:53 postgres invoked oom-killer: gfp_mask=0xd0, order=0,
oom_adj=0, oom_score_adj=0
May 15 19:44:53 cspug kernel: postgres cpuset=recursive-builds
mems_allowed=0
May 15 19:44:53 cspug kernel: Pid: 17159, comm: postgres Not tainted
2.6.32-431.17.1.el6.centos.plus.x86_64 #1
AFAIK 2GB is more than enough for a buildfarm machine (after all,
chipmunk hass just 512MB). Also, this only happens on this VM
(cpuset=recursive-builds), the other two VMs, with exactly the same
limits, running other buildfarm animals (regular or with
CLOBBER_CACHE_ALWAYS) are perfectly happy. See magpie or markhor for
example. And I don't see any reason why a build with recursive clobber
should require more memory than a regular build. So this seems like a
memory leak somewhere in the cache invalidation code.
I thought it might be fixed by commit b23b0f5588 (Code review for recent
changes in relcache.c), but mite is currently working on
7894ac5004 and yet it already failed on OOM.
The failures apparently happen within a few hours of the test start. For
example on addax (gcc), the build started on 02:50 and the first OOM
failure happened on 05:19, on mite (clang), it's 03:20 vs. 06:50. So
it's like ~3-4 after the tests start.
regards
Tomas
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tomas Vondra <tv@fuzzy.cz> writes:
... then of course the usual 'terminating connection because of crash of
another server process' warning. Apparently, it's getting killed by the
OOM killer, because it exhausts all the memory assigned to that VM (2GB).
Can you fix things so it runs into its process ulimit before the OOM killer
triggers? Then we'd get a memory map dumped to stderr, which would be
helpful in localizing the problem.
... So this seems like a
memory leak somewhere in the cache invalidation code.
Smells that way to me too, but let's get some more evidence.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17.5.2014 19:55, Tom Lane wrote:
Tomas Vondra <tv@fuzzy.cz> writes:
... then of course the usual 'terminating connection because of crash of
another server process' warning. Apparently, it's getting killed by the
OOM killer, because it exhausts all the memory assigned to that VM (2GB).Can you fix things so it runs into its process ulimit before the OOM killer
triggers? Then we'd get a memory map dumped to stderr, which would be
helpful in localizing the problem.
I did this in /etc/security/limits.d/80-pgbuild.conf:
pgbuild hard as 1835008
so the user the buildfarm runs under will have up to ~1.75GB of RAM (of
the 2GB available to the container).
... So this seems like a
memory leak somewhere in the cache invalidation code.Smells that way to me too, but let's get some more evidence.
The tests are already running, and there are a few postgres processes:
PID VIRT RES %CPU TIME+ COMMAND
11478 449m 240m 100.0 112:53.57 postgres: pgbuild regression [local]
CREATE VIEW
11423 219m 19m 0.0 0:00.17 postgres: checkpointer process
11424 219m 2880 0.0 0:00.05 postgres: writer process
11425 219m 5920 0.0 0:00.12 postgres: wal writer process
11426 219m 2708 0.0 0:00.05 postgres: autovacuum launcher process
11427 79544 1836 0.0 0:00.17 postgres: stats collector process
11479 1198m 1.0g 0.0 91:09.99 postgres: pgbuild regression [local]
CREATE INDEX waiting
Attached is 'pmap -x' output for the two interesting processes (11478,
11479).
Tomas
On 2014-05-17 20:41:37 +0200, Tomas Vondra wrote:
On 17.5.2014 19:55, Tom Lane wrote:
Tomas Vondra <tv@fuzzy.cz> writes:
The tests are already running, and there are a few postgres processes:
PID VIRT RES %CPU TIME+ COMMAND
11478 449m 240m 100.0 112:53.57 postgres: pgbuild regression [local]
CREATE VIEW
11423 219m 19m 0.0 0:00.17 postgres: checkpointer process
11424 219m 2880 0.0 0:00.05 postgres: writer process
11425 219m 5920 0.0 0:00.12 postgres: wal writer process
11426 219m 2708 0.0 0:00.05 postgres: autovacuum launcher process
11427 79544 1836 0.0 0:00.17 postgres: stats collector process
11479 1198m 1.0g 0.0 91:09.99 postgres: pgbuild regression [local]
CREATE INDEX waitingAttached is 'pmap -x' output for the two interesting processes (11478,
11479).
Could you gdb -p 11479 into the process and issue 'p
MemoryContextStats(TopMemoryContext)'. That should print information
about the server's allocation to its stderr.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17.5.2014 21:31, Andres Freund wrote:
On 2014-05-17 20:41:37 +0200, Tomas Vondra wrote:
On 17.5.2014 19:55, Tom Lane wrote:
Tomas Vondra <tv@fuzzy.cz> writes:
The tests are already running, and there are a few postgres processes:
PID VIRT RES %CPU TIME+ COMMAND
11478 449m 240m 100.0 112:53.57 postgres: pgbuild regression [local]
CREATE VIEW
11423 219m 19m 0.0 0:00.17 postgres: checkpointer process
11424 219m 2880 0.0 0:00.05 postgres: writer process
11425 219m 5920 0.0 0:00.12 postgres: wal writer process
11426 219m 2708 0.0 0:00.05 postgres: autovacuum launcher process
11427 79544 1836 0.0 0:00.17 postgres: stats collector process
11479 1198m 1.0g 0.0 91:09.99 postgres: pgbuild regression [local]
CREATE INDEX waitingAttached is 'pmap -x' output for the two interesting processes (11478,
11479).Could you gdb -p 11479 into the process and issue 'p
MemoryContextStats(TopMemoryContext)'. That should print information
about the server's allocation to its stderr.
That process already finished, but I've done that for another process
(that had ~400MB allocated, and was growing steadily - about 1MB / 10
seconds). It was running a SELECT query and already completed.
Then it executed ANALYZE, and I took several snapshots of it - not sure
how much memory it had at the beginning (maybe ~250MB), then when it had
~350MB, 400MB, 500MB and 600MB. It's still running, not sure how much
will it grow.
Anyway, the main difference between the analyze snapshot seems to be this:
init: CacheMemoryContext: 67100672 total in 17 blocks; ...
350MB: CacheMemoryContext: 134209536 total in 25 blocks; ...
400MB: CacheMemoryContext: 192929792 total in 32 blocks; ...
500MB: CacheMemoryContext: 293593088 total in 44 blocks; ...
600MB: CacheMemoryContext: 411033600 total in 58 blocks; ...
Not sure if there's something wrong with the SELECT memory context. It
has ~1500 of nested nodes like these:
SQL function data: 24576 total in 2 blocks; ...
ExecutorState: 24576 total in 2 blocks; ...
SQL function data: 24576 total in 2 blocks; ...
ExprContext: 8192 total in 1 blocks; ...
But maybe it's expected / OK.
regards
Tomas
Attachments:
select.log.gzapplication/gzip; name=select.log.gzDownload
��wS select.log ��M�]Wv�y~��3��(*A b$�{�[��F�� Q�����&YU��q���k��<�nt�U��}�^�����_����������?x�������o^~��������t���O�|����?�������������9���������?��/����������_�~�����~��������O����o��������_~�w/_������~����c�{���z��s�����o_|�����w�������O�����_N?���/���'����~�����7_�����/_}������O�~�������o|���7�^�a?|���O?���~��^�z���������������?�����?~����G��������~�������������o�������7_���<�m_|����y�T�����������O������7���3��������?�Gy��_����1{��������s�y�>��|����?���W�^�ao^�y��^�|���N�����~������?��p�����w�?���/�{��x���<��������������/o?���c����O��G���/^<��M��-��
��������=��~�^�F�_�����W��ku��F�z~�Q���_u��;�W�k��#���Q�A�����$��,��4g��<g��Dg�Lg�TgQ�\gA�dg��lg��tg��|g���g��g ��gY��gI��g���g���g���g���g��g ��gY��gI��g���g���g���g�?��
�Y������qV�#9�����h��}<g�>����Y������uV�#;��������|g�?��>�� >��Y>��I>���>���>���>���>��?�� ?��Y?��I?���?���?���?����3���x>�8<�H|��x������9��3��<*y\���\�����s���>��]�e<J�x���H-��Z�����k=��j�m<j�x����m��������o?��n�cx�q<�G�0�cy4��<�G�p�6<�Mx\���6��m������=�ps�><�]x���Hw���������?�xw�1<�Cx��������?��<_��|
b�U��!_�0|-b����#_�p|Mb�U ��%._��|mb�� ��'>_��|�r�UJ��)_�4|�r��J��+_�t|�r�UK��-/_��|�r��K��/?_��|
k�U,���_�2|-k��,���_�r|Mk�U-���._��|mk��-���>_��|����<_�y�JO��z.��S������t|����>%_�y��O��~n��S���0=� g�N�8����c�q<�5G_qLp���\���
��8:����=��p�����',�Ha!�
k8ZX���Z�Vr�����
7l��a#�{8z����^� vrqGpqGq,q$Gr<q4GsL����W�����-9�h��Ek�0s�����g��H��=9����Fo�8:s�1���w���c�#9����h�@s��� �D�H$�"�8 ��H4�H�qL�����D��$(�&�9: ��It�P�s������d�H% �*�8ZI��J6�X�q�������d��%)�.�9zI��Kv�`�sS����T�H��2�8�)��L5�h�qLS�����T���(�6�9�)��Mu�p�s�3G9p�3G:�p�3G;q�3G<�q�3#G=r�3+G>�r�33G?s�3;� &�5 ^p�:����#�x�� _x=�S^���<�U�x]�e^p��\��Wy}�m^!p����Uwy��2^)���
,���y��:^1�����U�y��6^9���l���y��>^A�����U�y�1���^Kp�&8���+
�5�����y]A^Y�������x}A[^a�����U�y�A^i�������x�A_^q�����U�y��^y0������x��X^�0�� ��U�yB^���"���ax=B,^���$��U �x]BL^���&��� ay}Bl^���(��U
�y�B^����*���
ix�B.^����,��U�x�BN^����.���iy�Bn^����0��U�yC
^�P��2���ex=C-^�P��4��U
�x]CM^�P��6���
ey}Cm^�P��8��U�y����a
^�0�v���;��+��5�����x�����aJ^�0/�~���?�� S3 F��q 83 'c��h |b<