When Update balloons memory
About the system:
Ubuntu 20.04, 64GB ram, 16GB shared buffer, 500 MB working mem, Postgresql 14.1
Core issue:
The following statement below, when not divided up into chunks, but run across all 800M rows, did trigger an OOM-kill from the OS.
I have looked into it by kernel logs as well as postgresql logs. The postgresql just says it was killed, and the OS killed it due to the fact that all mem including swap was exhausted.
Looking at TOP while updating, I can see the RSS column of a single postgresql process (the connection I assume), just grow and grow until it chokes the system.
Statement:
Update table alfa
set x = beta.x
from beta where beta.id=alpha.id and x <> beta.x
alpha is a wide table (40 columns), partitioned into 5 equally partitions by year. Total row count 800M rows
beta is a 10 column 40M rows table.
the updated field x is non-indexed varchar; the id fields are indexed.
there are no triggers
I am well aware that huge updates have general issues, like locking the table etc, and it is perhaps discouraged. And I did solve it by batching it in 1M and 1M rows.
However, my curiosity still remains of what is really happening here. Why do Postgresql run out of memory? Exactly what is it storing in that memory? I am aware of the work_mem danger, but that is not what is happening here. I can replicate this with 32MB work mem as well; This is a low connection database.
Any help is appreciated.
Klaudie
track_activity_query_size = 4096
synchronous_commit = off
full_page_writes = off
#wal_compression = on
wal_level = minimal
max_wal_senders = 0
log_min_duration_statement = 1000
idle_in_transaction_session_timeout = '300s' # in milliseconds, 0 is disabled
tcp_keepalives_idle = '300s'
max_connections = 50
shared_buffers = 16GB
effective_cache_size = 48GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
min_wal_size = 4GB
max_wal_size = 16GB
#wal_buffers = 16MB
default_statistics_target = 1000
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 1000MB
max_worker_processes = 8
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
max_parallel_maintenance_workers = 4
cpu_tuple_cost = 0.03
This has no solution for the issue but...
On Tue, 7 Dec 2021 at 10:16, Klaudie Willis
<Klaudie.Willis@protonmail.com> wrote:
Ubuntu 20.04, 64GB ram, 16GB shared buffer, 500 MB working mem, Postgresql 14.1
...
shared_buffers = 16GB
effective_cache_size = 48GB
... You are not going to have total ram - shared buffers in the cache,
os, postgres, work mem, other processess and all sort of different
things eat ram. I would suggest looking at free/top/whatever too size
this ( it should not OOM, just distort pg estimates ).
Francisco Olarte.
Klaudie Willis <Klaudie.Willis@protonmail.com> writes:
The following statement below, when not divided up into chunks, but run across all 800M rows, did trigger an OOM-kill from the OS.
An UPDATE should only result in memory bloat if it's queuing trigger
events to be processed at end-of-statement. You claim there are
no triggers, but are you sure? (what about foreign keys?)
Otherwise, it seems possible that you've identified a memory leak,
but there's not enough detail here to investigate. Can you create
a reproducible test case?
regards, tom lane
Thanks for the insight!
I have recreated the problem on a different machine and installation where I was more free to experiment to isolate what causes this.
So, it seems like the index is central cog here:
create index ind1 on alpha ((deltatime::date));
where "alpha" is a partition tableset partitioned by (deltatime::date)
The general and simple updates like:
update alphatable set gamma=gamma || "#postfix#"
makes the process memory balloon to the point of OOM.
If I remove the ind1 index on "deltatime::date", and just add another one on a random column, the problem disappears. So it seems like the index on the partition key is relevant.
Additional info, alphatable is a 200M evenly distributed row across the partitions, and I haven't tried to see if the ::date casting is relevant for the problem. No there are no triggers here; I can't vouch for what the system creates behind my back though.
Is this a feature or a bug?
--
Klaudie
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, December 7th, 2021 at 15:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Klaudie Willis Klaudie.Willis@protonmail.com writes:
The following statement below, when not divided up into chunks, but run across all 800M rows, did trigger an OOM-kill from the OS.
An UPDATE should only result in memory bloat if it's queuing trigger
events to be processed at end-of-statement. You claim there are
no triggers, but are you sure? (what about foreign keys?)
Otherwise, it seems possible that you've identified a memory leak,
but there's not enough detail here to investigate. Can you create
a reproducible test case?
regards, tom lane
Klaudie Willis <Klaudie.Willis@protonmail.com> writes:
So, it seems like the index is central cog here:
create index ind1 on alpha ((deltatime::date));
where "alpha" is a partition tableset partitioned by (deltatime::date)
The general and simple updates like:update alphatable set gamma=gamma || "#postfix#"
makes the process memory balloon to the point of OOM.
That seems like a bug, but please supply a self-contained test case
rather than expecting other people to reverse-engineer one.
regards, tom lane
Hi,
Turns out the base case is simpler than I thought. Not involving partitions at all
CREATE TABLE public.part_main (
txid bigint,
actiondate timestamp without time zone NOT NULL
);
insert into part_main
select x, '2019-06-01'::timestamp + x%365 * interval '1 day'
from generate_series(1, 30 * 1E6) as x;
CREATE INDEX partindx ON public.part_main USING btree ((actiondate)::date); -- mem bug?
-- CREATE INDEX partindx ON public.part_main USING btree (actiondate); -- no bug
-- mem runaway follows
update part_main set txid = txid + 1;
Hope you can replicate it.
best regards
Klaudie
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, December 14th, 2021 at 12:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Klaudie Willis Klaudie.Willis@protonmail.com writes:
So, it seems like the index is central cog here:
create index ind1 on alpha ((deltatime::date));
where "alpha" is a partition tableset partitioned by (deltatime::date)
The general and simple updates like:
update alphatable set gamma=gamma || "#postfix#"
makes the process memory balloon to the point of OOM.
That seems like a bug, but please supply a self-contained test case
rather than expecting other people to reverse-engineer one.
regards, tom lane
On Tue, 14 Dec 2021 08:16:08 +0000
Klaudie Willis <Klaudie.Willis@protonmail.com> wrote:
CREATE INDEX partindx ON public.part_main USING btree ((actiondate)::date); -- mem bug?
Nope, syntax error
ERROR: syntax error at or near "::"
LINE 1: ...indx_1 ON public.part_main USING btree ((actiondate)::date);
^
-- CREATE INDEX partindx ON public.part_main USING btree (actiondate); -- no bug
-- mem runaway follows
update part_main set txid = txid + 1;Hope you can replicate it.
Can't replicate on my Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz with 2Go of RAM
time psql -c 'update part_main set txid = txid + 1' vv
UPDATE 31000000
real 24m39.594s
user 0m0.121s
sys 0m0.036s
--
Bien à vous, Vincent Veyron
https://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
[ redirecting to pgsql-bugs ]
Klaudie Willis <Klaudie.Willis@protonmail.com> writes:
Turns out the base case is simpler than I thought. Not involving partitions at all
CREATE TABLE public.part_main (
txid bigint,
actiondate timestamp without time zone NOT NULL
);
insert into part_main
select x, '2019-06-01'::timestamp + x%365 * interval '1 day'
from generate_series(1, 30 * 1E6) as x;
CREATE INDEX partindx ON public.part_main USING btree ((actiondate)::date); -- mem bug?
-- mem runaway follows
update part_main set txid = txid + 1;
ITYM "((actiondate::date))", but yeah, this leaks memory like there's
no tomorrow. I traced it to 9dc718bdf (Pass down "logically unchanged
index" hint), which has added a function index_unchanged_by_update()
that (a) looks fairly expensive, (b) leaks a copy of every expression
tree it examines, and (c) is invoked over again for each row, even
though AFAICS the answer shouldn't change across rows. This seems very
poorly thought through. Peter?
regards, tom lane
PS: personally I would have used pull_varnos() instead of reinventing
that wheel. But in any case the real problem is repeated invocation.
So sorry about that;
I'll repost it here, corrected, for others to use who wants to exhaust their memory:
--PG-14.1
CREATE TABLE public.part_main (
txid bigint,
actiondate timestamp without time zone NOT NULL
);
insert into part_main
select x, '2019-06-01'::timestamp + x%365 * interval '1 day'
from generate_series(1, 30 * 1E6) as x;
CREATE INDEX partindx ON public.part_main USING btree ((actiondate::date)); -- mem bug?
-- CREATE INDEX partindx ON public.part_main USING btree (actiondate); -- no bug
-- mem runaway follows
update part_main set txid = txid + 1;
Klaudie
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, December 14th, 2021 at 16:58, Vincent Veyron <vv.lists@wanadoo.fr> wrote:
Show quoted text
On Tue, 14 Dec 2021 08:16:08 +0000
Klaudie Willis Klaudie.Willis@protonmail.com wrote:
CREATE INDEX partindx ON public.part_main USING btree ((actiondate)::date); -- mem bug?
Nope, syntax error
ERROR: syntax error at or near "::"
LINE 1: ...indx_1 ON public.part_main USING btree ((actiondate)::date);
^
-- CREATE INDEX partindx ON public.part_main USING btree (actiondate); -- no bug
-- mem runaway follows
update part_main set txid = txid + 1;
Hope you can replicate it.
Can't replicate on my Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz with 2Go of RAM
time psql -c 'update part_main set txid = txid + 1' vv
UPDATE 31000000
real 24m39.594s
user 0m0.121s
sys 0m0.036s
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bien à vous, Vincent Veyron
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
Klaudie Willis <Klaudie.Willis@protonmail.com> writes:
I'll repost it here, corrected, for others to use who wants to exhaust their memory:
--PG-14.1
This leak is new in v14, possibly that's why Vincent didn't reproduce it.
regards, tom lane
On Tue, 14 Dec 2021 11:18:07 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:
This leak is new in v14, possibly that's why Vincent didn't reproduce it.
Indeed, I'm on v11
--
Bien à vous, Vincent Veyron
https://marica.fr
Gestion des contentieux juridiques, des contrats et des sinistres d'assurance
On Tue, Dec 14, 2021 at 7:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
ITYM "((actiondate::date))", but yeah, this leaks memory like there's
no tomorrow. I traced it to 9dc718bdf (Pass down "logically unchanged
index" hint), which has added a function index_unchanged_by_update()
that (a) looks fairly expensive, (b) leaks a copy of every expression
tree it examines, and (c) is invoked over again for each row, even
though AFAICS the answer shouldn't change across rows. This seems very
poorly thought through. Peter?
Ugh, what a howler. Clearly I am at fault here. Apologies.
Are you sure that it would really be worth the trouble of caching our
answer? It's not clear that that has only minimal maintenance burden.
I have always suspected that index_unchanged_by_update() was at least
slightly over-engineered.
The fact is that most individual aminsert() calls that get the hint
will never actually apply it in any way. In practice the hint is only
relevant when there isn't enough space on an nbtree leaf page to fit
the incoming item. Even then, it won't be used when there are LP_DEAD
bits set on the leaf page -- we prefer to perform a conventional index
deletion over a bottom-up index deletion. And so there is a fair
practical argument to be made in favor of assuming that we should give
the hint in cases where we can't rule it out inexpensively. Of course
that assumes that there will be no other use for the hint in the
future. I'm not making this argument myself, but it does seem like a
factor worth considering.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
Are you sure that it would really be worth the trouble of caching our
answer? It's not clear that that has only minimal maintenance burden.
I'd be inclined to do so if we can find a suitable place to put it.
But wouldn't a field in IndexInfo serve? Letting the field default
to "not optimizable" would cover most cases.
The fact is that most individual aminsert() calls that get the hint
will never actually apply it in any way.
Yeah, you could make an argument that just not trying to optimize when
there are index expressions would be fine for this --- and we may have
to fix it that way in v14, because I'm not sure whether adding a field
in IndexInfo would be safe ABI-wise. But ISTM that the overhead of
index_unchanged_by_update is a bit more than I care to pay per row
even when it's only considering plain index columns. I'm generally
allergic to useless per-row computations, especially when they're
being added by an alleged performance improvement.
Another thing we ought to check into is the extent to which this
is duplicative of the setup calculations for HOT updates --- I seem
to recall that there's already roughly-similar logic somewhere else.
And, not to be too picky, but does this cope with the case where
an indexed column is changed by a BEFORE trigger, not by the
query proper?
regards, tom lane
On Tue, Dec 14, 2021 at 11:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'd be inclined to do so if we can find a suitable place to put it.
But wouldn't a field in IndexInfo serve? Letting the field default
to "not optimizable" would cover most cases.
I'll come up with a patch for that soon.
Yeah, you could make an argument that just not trying to optimize when
there are index expressions would be fine for this --- and we may have
to fix it that way in v14, because I'm not sure whether adding a field
in IndexInfo would be safe ABI-wise. But ISTM that the overhead of
index_unchanged_by_update is a bit more than I care to pay per row
even when it's only considering plain index columns. I'm generally
allergic to useless per-row computations, especially when they're
being added by an alleged performance improvement.
I am tempted to broach the idea of always giving the hint in the case
of a non-HOT update, actually. But that's probably too weird to
countenance when you take a broader, API-level view of things. (So
I'll skip the explanation of why I think that might be reasonable from
the point of view of the nbtree code.)
Another thing we ought to check into is the extent to which this
is duplicative of the setup calculations for HOT updates --- I seem
to recall that there's already roughly-similar logic somewhere else.
That's handled fairly directly, on the heapam side. At the top of
heap_update(), with some relcache infrastructure. Unlike
heap_update(), index_unchanged_by_update() cares about which specific
indexes have "logically modified" attributes. We already know for sure
that the update can't have been a HOT UPDATE when
index_unchanged_by_update() is reached, of course.
And, not to be too picky, but does this cope with the case where
an indexed column is changed by a BEFORE trigger, not by the
query proper?
No. It's much better to err in the direction of giving the hint,
rather than not giving the hint. In order for us to make the category
of error that seems like it might actually be a problem (not giving
the hint when we should), the BEFORE trigger would have to "undo" an
explicit change to an updated column.
We also want to give the hint when a partial index is subject to lots
of non-HOT updates, when successive updates make the predicate flip
between matching and not matching. That was shown to be particularly
valuable (with a workload that has such an index). So the fact that we
don't handle predicates is intentional, even though the justification
for that relies on an implementation deficiency in HOT, that might be
fixed some day.
--
Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
On Tue, Dec 14, 2021 at 11:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
And, not to be too picky, but does this cope with the case where
an indexed column is changed by a BEFORE trigger, not by the
query proper?
No. It's much better to err in the direction of giving the hint,
rather than not giving the hint. In order for us to make the category
of error that seems like it might actually be a problem (not giving
the hint when we should), the BEFORE trigger would have to "undo" an
explicit change to an updated column.
Uh ... it seems that you are writing as though "giving the hint"
means saying that the column value changed. That seems quite
confusingly backwards to me, as that is/ought to be the expected
assumption. Maybe you should invert the flag state while you
are at it.
regards, tom lane
On Tue, Dec 14, 2021 at 3:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Uh ... it seems that you are writing as though "giving the hint"
means saying that the column value changed. That seems quite
confusingly backwards to me, as that is/ought to be the expected
assumption.
That's not what I meant. When I say "give the hint", I mean pass
"indexUnchanged = true" to aminsert(). This is interpreted within
btinsert() as "the incoming index tuple is a duplicate of at least one
index, so perform a bottom-up index deletion pass if and when the
alternative is splitting the leaf page". In practice the hint always
has to be treated as a noisy signal about what might work, as a
strategy of last resort, with costs that are imposed on non-HOT
updaters.
--
Peter Geoghegan
On Tue, Dec 14, 2021 at 3:17 PM Peter Geoghegan <pg@bowt.ie> wrote:
I'll come up with a patch for that soon.
It seems that I've run out of time to do this before traveling to see
family over the holidays. I'll return to this early in the new year,
still well in time to get a fix into 14.2.
Thanks
--
Peter Geoghegan
On Tue, Dec 14, 2021 at 11:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'd be inclined to do so if we can find a suitable place to put it.
But wouldn't a field in IndexInfo serve? Letting the field default
to "not optimizable" would cover most cases.
Attached draft HEAD-only bugfix adds two new bool fields to IndexInfo.
The first bool indicates if we've already done the required work for
this IndexInfo. The second field is used as a cache (if the cache is
set the first bool is 'true'). These two fields fit in existing
alignment padding, so the marginal space overhead is zero.
I'll probably need to greatly simplify the code for backpatch, to
avoid an ABI break. Seems fine to teach index_unchanged_by_update to
return "true" unconditionally, given how the IndexUnchanged hint is
currently applied.
I haven't made the code use pull_varnos(), which you suggested back in
December. It looks like it would be tricky to do that from the
executor, since pull_varnos() has a PlannerInfo* argument. That has
been the case since your commit 55dc86eca7 from January 2021, "Fix
pull_varnos' miscomputation of relids set for a PlaceHolderVar".
Please advise.
--
Peter Geoghegan
Attachments:
v1-0001-Fix-memory-leak-in-indexUnchanged-hint-mechanism.patchapplication/octet-stream; name=v1-0001-Fix-memory-leak-in-indexUnchanged-hint-mechanism.patchDownload+31-3
Peter Geoghegan <pg@bowt.ie> writes:
I haven't made the code use pull_varnos(), which you suggested back in
December. It looks like it would be tricky to do that from the
executor, since pull_varnos() has a PlannerInfo* argument. That has
been the case since your commit 55dc86eca7 from January 2021, "Fix
pull_varnos' miscomputation of relids set for a PlaceHolderVar".
Please advise.
Pass NULL for that, per 6867f963e:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6867f963e#patch2
We'd have to back-patch that bit, but I don't see any problem
with doing so.
regards, tom lane
On Tue, Jan 11, 2022 at 11:54 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Pass NULL for that, per 6867f963e:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6867f963e#patch2
Will look into that.
We'd have to back-patch that bit, but I don't see any problem
with doing so.
But the back-patch fix will make index_unchanged_by_update return true
unconditionally (to avoid an ABI break). It won't actually do anything
with Vars, which, as I said, seems okay given the current way in which
we apply the hint.
--
Peter Geoghegan