Stack overflow issue

Started by Egor Chindyaskinalmost 4 years ago44 messageshackers

kyzevan23@mail.ru

almost 4 years ago

Hello, I recently got a server crash (bug #17583 [1]/messages/by-id/CAMbWs499ytQiH4mLMhRxRWP-iEUz3-DSinpAD-cUCtVo_23Wtg@mail.gmail.com) caused by a stack overflow.

Tom Lane and Richard Guo, in a discussion of this bug, suggested that there could be more such places.
Therefore, Alexander Lakhin and I decided to deal with this issue and Alexander developed a methodology. We processed src/backend/*/*.c with "clang -emit-llvm ... | opt -analyze -print-calgraph" to find all the functions that call themselves directly. I checked each of them for features that protect against stack overflows.
We analyzed 4 catalogs: regex, tsearch, snowball and adt.
Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call.

zaptreesubs
markst
next
nfatree
numst
repeat

We have tried to exploit the recursion in the function zaptreesubs():
select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1', 11000) || ')?');

ERROR: invalid regular expression: regular expression is too complex

repeat():
select regexp_match('abc01234xyz',repeat('a{0,2}',100001));

ERROR: invalid regular expression: regular expression is too complex

numst():
select regexp_match('abc01234xyz',repeat('(.)\1e',100001));

ERROR: invalid regular expression: regular expression is too complex

markst():
markst is called in the code after v->tree = parse(...);
it is necessary that the tree be successfully parsed, but with a nesting level of about 100,000 this will not work - stack protection will work during parsing and v->ntree = numst(...); is also there.

next():
we were able to crash the server with the following query:
(printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<1000000;i++)); do printf "(?#)"; done; printf "b')" ) | psql

Secondly, we have tried to exploit the recursion in the adt catalog functions and Alexander was able to crash the server with the following query:

regex_selectivity_sub():
SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 200000) || 'b)');

And this query:

(n=100000;
printf "SELECT polygon '((0,0),(0,1000000))' <@ polygon '((-200000,1000000),";
for ((i=1;i<$n;i++)); do printf "(100000,$(( 300000 + $i))),(-100000,$((800000 + $i))),"; done;
printf "(200000,900000),(200000,0))';"
) | psql

Thirdly, the snowball catalog, Alexander has tried to exploit the recursion in the r_stem_suffix_chain_before_ki function and crashed a server using this query:

r_stem_suffix_chain_before_ki():
SELECT ts_lexize('turkish_stem', repeat('lerdeki', 1000000));

The last one is the tsearch catalog. We have found 4 functions that didn't have check_stach_depth() function:

SplitToVariants
mkANode
mkSPNode
LexizeExec

We have tried to exploit the recursion in the SplitToVariants function and Alexander crashed a server using this:

SplitToVariants():
CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell, DictFile=ispell_sample,AffFile=ispell_sample);
SELECT ts_lexize('ispell', repeat('bally', 10000));

After trying to exploit the recursion in the LexizeExec function Alexander made this conlusion:

LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and "ld->curDictId != InvalidOid" (multiword mode) - we start with the first one, then make recursive call to switch to the multiword mode, but then we return to the usual mode again.

mkANode and mkSPNode deal with the dictionary structs, not with user-supplied data, so we believe these functions are not vulnerable.

[1]: /messages/by-id/CAMbWs499ytQiH4mLMhRxRWP-iEUz3-DSinpAD-cUCtVo_23Wtg@mail.gmail.com

mahendrakar s

mahendrakarforpg@gmail.com

almost 4 years ago

In reply to: Egor Chindyaskin (#1)

Re: Stack overflow issue

Hi,
Can we have a parameter to control the recursion depth in these cases to
avoid crashes?
Just a thought.

Thanks,
Mahendrakar.

On Wed, 24 Aug, 2022, 3:21 pm Егор Чиндяскин, <kyzevan23@mail.ru> wrote:

Show quoted text

Hello, I recently got a server crash (bug #17583 [1]) caused by a stack
overflow.

Tom Lane and Richard Guo, in a discussion of this bug, suggested that
there could be more such places.
Therefore, Alexander Lakhin and I decided to deal with this issue and
Alexander developed a methodology. We processed src/backend/*/*.c with
"clang -emit-llvm ... | opt -analyze -print-calgraph" to find all the
functions that call themselves directly. I checked each of them for
features that protect against stack overflows.
We analyzed 4 catalogs: regex, tsearch, snowball and adt.
Firstly, we decided to test the regex catalog functions and found 6 of
them that lack the check_stach_depth() call.

zaptreesubs
markst
next
nfatree
numst
repeat

We have tried to exploit the recursion in the function zaptreesubs():
select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1',
11000) || ')?');

ERROR: invalid regular expression: regular expression is too complex

repeat():
select regexp_match('abc01234xyz',repeat('a{0,2}',100001));

ERROR: invalid regular expression: regular expression is too complex

numst():
select regexp_match('abc01234xyz',repeat('(.)\1e',100001));

ERROR: invalid regular expression: regular expression is too complex

markst():
markst is called in the code after v->tree = parse(...);
it is necessary that the tree be successfully parsed, but with a nesting
level of about 100,000 this will not work - stack protection will work
during parsing and v->ntree = numst(...); is also there.

next():
we were able to crash the server with the following query:
(printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<1000000;i++)); do
printf "(?#)"; done; printf "b')" ) | psql

Secondly, we have tried to exploit the recursion in the adt catalog
functions and Alexander was able to crash the server with the following
query:

regex_selectivity_sub():
SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 200000) ||
'b)');

And this query:

(n=100000;
printf "SELECT polygon '((0,0),(0,1000000))' <@ polygon
'((-200000,1000000),";
for ((i=1;i<$n;i++)); do printf "(100000,$(( 300000 +
$i))),(-100000,$((800000 + $i))),"; done;
printf "(200000,900000),(200000,0))';"
) | psql

Thirdly, the snowball catalog, Alexander has tried to exploit the
recursion in the r_stem_suffix_chain_before_ki function and crashed a
server using this query:

r_stem_suffix_chain_before_ki():
SELECT ts_lexize('turkish_stem', repeat('lerdeki', 1000000));

The last one is the tsearch catalog. We have found 4 functions that didn't
have check_stach_depth() function:

SplitToVariants
mkANode
mkSPNode
LexizeExec

We have tried to exploit the recursion in the SplitToVariants function and
Alexander crashed a server using this:

SplitToVariants():
CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell,
DictFile=ispell_sample,AffFile=ispell_sample);
SELECT ts_lexize('ispell', repeat('bally', 10000));

After trying to exploit the recursion in the LexizeExec function Alexander
made this conlusion:

LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and
"ld->curDictId != InvalidOid" (multiword mode) - we start with the first
one, then make recursive call to switch to the multiword mode, but then we
return to the usual mode again.

mkANode and mkSPNode deal with the dictionary structs, not with
user-supplied data, so we believe these functions are not vulnerable.

[1]
/messages/by-id/CAMbWs499ytQiH4mLMhRxRWP-iEUz3-DSinpAD-cUCtVo_23Wtg@mail.gmail.com

Alvaro Herrera

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: mahendrakar s (#2)

Re: Stack overflow issue

On 2022-Aug-24, mahendrakar s wrote:

Hi,
Can we have a parameter to control the recursion depth in these cases to
avoid crashes?

We already have one (max_stack_depth). The problem is lack of calling
the control function in a few places.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/

mahendrakar s

mahendrakarforpg@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#3)

Re: Stack overflow issue

Thanks.

On Wed, 24 Aug, 2022, 4:19 pm Alvaro Herrera, <alvherre@alvh.no-ip.org>
wrote:

Show quoted text

On 2022-Aug-24, mahendrakar s wrote:

Hi,
Can we have a parameter to control the recursion depth in these cases to
avoid crashes?

We already have one (max_stack_depth). The problem is lack of calling
the control function in a few places.

--
Álvaro Herrera 48°01'N 7°57'E —
https://www.EnterpriseDB.com/

Richard Guo

guofenglinux@gmail.com

almost 4 years ago

In reply to: Alvaro Herrera (#3)

Re: Stack overflow issue

On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Aug-24, mahendrakar s wrote:

Hi,
Can we have a parameter to control the recursion depth in these cases to
avoid crashes?

We already have one (max_stack_depth). The problem is lack of calling
the control function in a few places.

Thanks Egor and Alexander for the work! I think we can just add
check_stack_depth checks in these cases.

Thanks
Richard

Richard Guo

guofenglinux@gmail.com

almost 4 years ago

In reply to: Richard Guo (#5)

Re: Stack overflow issue

On Wed, Aug 24, 2022 at 7:12 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Aug-24, mahendrakar s wrote:

Hi,
Can we have a parameter to control the recursion depth in these cases to
avoid crashes?

We already have one (max_stack_depth). The problem is lack of calling
the control function in a few places.

Thanks Egor and Alexander for the work! I think we can just add
check_stack_depth checks in these cases.

Attached adds the checks in these places. But I'm not sure about the
snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly?

Thanks
Richard

mahendrakar s

mahendrakarforpg@gmail.com

almost 4 years ago

In reply to: Richard Guo (#6)

Re: Stack overflow issue

Hi Richard,

Patch is looking good to me. Would request others to take a look at it as
well.

Thanks,
Mahendrakar.

On Wed, 24 Aug 2022 at 17:24, Richard Guo <guofenglinux@gmail.com> wrote:

Show quoted text

On Wed, Aug 24, 2022 at 7:12 PM Richard Guo <guofenglinux@gmail.com>
wrote:

On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Aug-24, mahendrakar s wrote:

Hi,
Can we have a parameter to control the recursion depth in these cases

to

avoid crashes?

We already have one (max_stack_depth). The problem is lack of calling
the control function in a few places.

Thanks Egor and Alexander for the work! I think we can just add
check_stack_depth checks in these cases.

Attached adds the checks in these places. But I'm not sure about the
snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly?

Thanks
Richard

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Egor Chindyaskin (#1)

Re: Stack overflow issue

=?UTF-8?B?0JXQs9C+0YAg0KfQuNC90LTRj9GB0LrQuNC9?= <kyzevan23@mail.ru> writes:

Therefore, Alexander Lakhin and I decided to deal with this issue and Alexander developed a methodology. We processed src/backend/*/*.c with "clang -emit-llvm ... | opt -analyze -print-calgraph" to find all the functions that call themselves directly. I checked each of them for features that protect against stack overflows.
We analyzed 4 catalogs: regex, tsearch, snowball and adt.
Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call.

Nice work! I wonder if you can make the regex crashes reachable by
reducing the value of max_stack_depth enough that it's hit before
reaching the "regular expression is too complex" limit.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Richard Guo (#6)

Re: Stack overflow issue

Richard Guo <guofenglinux@gmail.com> writes:

Attached adds the checks in these places. But I'm not sure about the
snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly?

No, that file is generated code, as it says right at the top.

I think most likely we should report this to Snowball upstream
and see what they think is an appropriate fix.

regards, tom lane

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Egor Chindyaskin (#1)

Re: Stack overflow issue

=?UTF-8?B?0JXQs9C+0YAg0KfQuNC90LTRj9GB0LrQuNC9?= <kyzevan23@mail.ru> writes:

Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call.

zaptreesubs
markst
next
nfatree
numst
repeat

I took a closer look at these. I think the markst, numst, and nfatree
cases are non-issues. They are recursing on a subre tree that was just
built by parse(), so parse() must have successfully recursed the same
number of levels. parse() surely has a larger stack frame, and it
does have a stack overflow guard (in subre()), so it would have failed
cleanly before making a data structure that could be hazardous here.
Also, having markst error out would be problematic for the reasons
discussed in its comment, so I'm disinclined to try to add checks
that have no use.

BTW, I wonder why your test didn't notice freesubre()? But the
same analysis applies there, as does the concern that we can't
just error out.

Likewise, zaptreesubs() can't recurse more levels than cdissect() did,
and that has a stack check, so I'm not very excited about adding
another one there.

I believe that repeat() is a non-issue because (a) the number of
recursion levels in it is limited by DUPMAX, which is generally going
to be 255, or at least not enormous, and (b) it will recurse at most
once before calling dupnfa(), which contains stack checks.

I think next() is a legit issue, although your example doesn't crash
for me. I suppose that's because my compiler turned the tail recursion
into a loop, and I suggest that we fix it by doing that manually.
(Richard's proposed fix is wrong anyway: we can't just throw elog(ERROR)
in the regex code without creating memory leaks.)

regards, tom lane

#11

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Tom Lane (#9)

Re: Stack overflow issue

I wrote:

I think most likely we should report this to Snowball upstream
and see what they think is an appropriate fix.

Done at [1]https://lists.tartarus.org/pipermail/snowball-discuss/2022-August/001734.html, and I pushed the other fixes. Thanks again for the report!

regards, tom lane

[1]: https://lists.tartarus.org/pipermail/snowball-discuss/2022-August/001734.html

#12

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#11)

Re: Stack overflow issue

I wrote:

I think most likely we should report this to Snowball upstream
and see what they think is an appropriate fix.

Done at [1], and I pushed the other fixes. Thanks again for the report!

The upstream recommendation, which seems pretty sane to me, is to
simply reject any string exceeding some threshold length as not
possibly being a word. Apparently it's common to use thresholds
as small as 64 bytes, but in the attached I used 1000 bytes.

regards, tom lane

#13

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#12)

Re: Stack overflow issue

I wrote:

The upstream recommendation, which seems pretty sane to me, is to
simply reject any string exceeding some threshold length as not
possibly being a word. Apparently it's common to use thresholds
as small as 64 bytes, but in the attached I used 1000 bytes.

On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word". While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising. We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold. This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.

regards, tom lane

#14

Richard Guo

guofenglinux@gmail.com

over 3 years ago

In reply to: Tom Lane (#13)

Re: Stack overflow issue

On Wed, Aug 31, 2022 at 6:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

The upstream recommendation, which seems pretty sane to me, is to
simply reject any string exceeding some threshold length as not
possibly being a word. Apparently it's common to use thresholds
as small as 64 bytes, but in the attached I used 1000 bytes.

On further thought: that coding treats anything longer than 1000
bytes as a stopword, but maybe we should just accept it unmodified.
The manual says "A Snowball dictionary recognizes everything, whether
or not it is able to simplify the word". While "recognizes" formally
includes the case of "recognizes as a stopword", people might find
this behavior surprising. We could alternatively do it as attached,
which accepts overlength words but does nothing to them except
case-fold. This is closer to the pre-patch behavior, but gives up
the opportunity to avoid useless downstream processing of long words.

This patch looks good to me. It avoids overly-long words (> 1000 bytes)
going through the stemmer so the stack overflow issue in Turkish stemmer
should not exist any more.

Thanks
Richard

#15

Egor Chindyaskin

kyzevan23@mail.ru

over 3 years ago

In reply to: Tom Lane (#8)

Re: Stack overflow issue

24.08.2022 20:58, Tom Lane writes:

Nice work! I wonder if you can make the regex crashes reachable by
reducing the value of max_stack_depth enough that it's hit before
reaching the "regular expression is too complex" limit.

regards, tom lane

Hello everyone! It's been a while since me and Alexander Lakhin have
published a list of functions that have a stack overflow illness. We are
back to tell you more about such places.
During our analyze we made a conclusion that some functions can be
crashed without changing any of the parameters and some can be crashed
only if we change some stuff.

The first function crashes without any changes:

# CheckAttributeType

(n=60000; printf "create domain dint as int; create domain dinta0 as
dint[];"; for ((i=1;i<=$n;i++)); do printf "create domain dinta$i as
dinta$(( $i - 1 ))[]; "; done; ) | psql
psql -c "create table t(f1 dinta60000[]);"

Some of the others crash if we change "max_locks_per_transaction"
parameter:

# findDependentObjects

max_locks_per_transaction = 200

(n=10000; printf "create table t (i int); create view v0 as select *
from t;"; for ((i=1;i<$n;i++)); do printf "create view v$i as select *
from v$(( $i - 1 )); "; done; ) | psql
psql -c "drop table t"

# ATExecDropColumn

max_locks_per_transaction = 300

(n=50000; printf "create table t0 (a int, b int); "; for
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1
))); "; done; printf "alter table t0 drop b;" ) | psql

# ATExecDropConstraint

max_locks_per_transaction = 300

(n=50000; printf "create table t0 (a int, b int, constraint bc check (b

0));"; for ((i=1;i<=$n;i++)); do printf "create table t$i()

inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop
constraint bc;" ) | psql

# ATExecAddColumn

max_locks_per_transaction = 200

(n=50000; printf "create table t0 (a int, b int);"; for
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1
))); "; done; printf "alter table t0 add column c int;" ) | psql

# ATExecAlterConstrRecurse

max_locks_per_transaction = 300

(n=50000;
printf "create table t(a int primary key); create table pt (a int
primary key, foreign key(a) references t) partition by range (a);";
printf "create table pt0 partition of pt for values from (0) to (100000)
partition by range (a);";
for ((i=1;i<=$n;i++)); do printf "create table pt$i partition of pt$((
$i - 1 )) for values from ($i) to (100000) partition by range (a); "; done;
printf "alter table pt alter constraint pt_a_fkey deferrable initially
deferred;"
) | psql

This is where the fun begins. According to Tom Lane, a decrease in
max_stack_depth could lead to new crashes, but it turned out that
Alexander was able to find new crashes precisely due to the increase in
this parameter. Also, we had ulimit -s set to 8MB as the default value.

# eval_const_expressions_mutator

max_stack_depth = '7000kB'

(n=10000; printf "select 'a' "; for ((i=1;i<$n;i++)); do printf "
collate \"C\" "; done; ) | psql

If you didn’t have a crash, like me, when Alexander shared his find,
then probably you configured your cluster with an optimization flag -Og.
In the process of trying to break this function, we came to the
conclusion that the maximum stack depth depends on the optimization flag
(-O0/-Og). As it turned out, when optimizing, the function frame on the
stack becomes smaller and because of this, the limit is reached more
slowly, therefore, the system can withstand more torment. Therefore,
this query will fail if you have a cluster configured with the -O0
optimization flag.

The crash of the next function not only depends on the optimization
flag, but also on a number of other things. While researching, we
noticed that postgres enforces a distance ~400kB from max_stack_depth to
ulimit -s. We thought we could hit the max_stack_depth limit and then
hit the OS limit as well. Therefore, Alexander wrote a recursive SQL
function, that eats up a stack within max_stack_depth, including a query
that eats up the remaining ~400kB. And this causes a crash.

# executeBoolItem

max_stack_depth = '7600kB'

create function infinite_recurse(i int) returns int as $$
begin
raise notice 'Level %', i;
begin
    perform jsonb_path_query('{"a":[1]}'::jsonb, ('$.a[*] ? (' ||
repeat('!(', 4800) || '@ == @' || repeat(')', 4800) || ')')::jsonpath);
exception
    when others then raise notice 'jsonb_path_query error at level %,
%', i, sqlerrm;
end;
begin
    select infinite_recurse(i + 1) into i;
exception
    when others then raise notice 'Max stack depth reached at level %,
%', i, sqlerrm;
end;
return i;
end;
$$ language plpgsql;

select infinite_recurse(1);

To sum it all up, we have not yet decided on a general approach to such
functions. Some functions are definitely subject to stack overflow. Some
are definitely not. This can be seen from the code where the recurse
flag is passed, or a function that checks the stack is called before a
recursive call. Some require special conditions - for example, you need
to parse the query and build a plan, and at that stage the stack is
eaten faster (and checked) than by the function that we are interested in.

We keep researching and hope to come up with a good solution sooner or
later.

#16

Egor Chindyaskin

kyzevan23@mail.ru

over 3 years ago

In reply to: Egor Chindyaskin (#15)

Re: Stack overflow issue

Среда, 26 октября 2022, 21:47 +07:00 от Egor Chindyaskin <kyzevan23@mail.ru>:

24.08.2022 20:58, Tom Lane writes:

Nice work! I wonder if you can make the regex crashes reachable by
reducing the value of max_stack_depth enough that it's hit before
reaching the "regular expression is too complex" limit.

regards, tom lane Hello everyone! It's been a while since me and Alexander Lakhin have

published a list of functions that have a stack overflow illness. We are
back to tell you more about such places.
During our analyze we made a conclusion that some functions can be
crashed without changing any of the parameters and some can be crashed
only if we change some stuff.

The first function crashes without any changes:

# CheckAttributeType

(n=60000; printf "create domain dint as int; create domain dinta0 as
dint[];"; for ((i=1;i<=$n;i++)); do printf "create domain dinta$i as
dinta$(( $i - 1 ))[]; "; done; ) | psql
psql -c "create table t(f1 dinta60000[]);"

Some of the others crash if we change "max_locks_per_transaction"
parameter:

# findDependentObjects

max_locks_per_transaction = 200

(n=10000; printf "create table t (i int); create view v0 as select *
from t;"; for ((i=1;i<$n;i++)); do printf "create view v$i as select *
from v$(( $i - 1 )); "; done; ) | psql
psql -c "drop table t"

# ATExecDropColumn

max_locks_per_transaction = 300

(n=50000; printf "create table t0 (a int, b int); "; for
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1
))); "; done; printf "alter table t0 drop b;" ) | psql

# ATExecDropConstraint

max_locks_per_transaction = 300

(n=50000; printf "create table t0 (a int, b int, constraint bc check (b
> 0));"; for ((i=1;i<=$n;i++)); do printf "create table t$i()
inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop
constraint bc;" ) | psql

# ATExecAddColumn

max_locks_per_transaction = 200

(n=50000; printf "create table t0 (a int, b int);"; for
((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1
))); "; done; printf "alter table t0 add column c int;" ) | psql

# ATExecAlterConstrRecurse

max_locks_per_transaction = 300

(n=50000;
printf "create table t(a int primary key); create table pt (a int
primary key, foreign key(a) references t) partition by range (a);";
printf "create table pt0 partition of pt for values from (0) to (100000)
partition by range (a);";
for ((i=1;i<=$n;i++)); do printf "create table pt$i partition of pt$((
$i - 1 )) for values from ($i) to (100000) partition by range (a); "; done;
printf "alter table pt alter constraint pt_a_fkey deferrable initially
deferred;"
) | psql

This is where the fun begins. According to Tom Lane, a decrease in
max_stack_depth could lead to new crashes, but it turned out that
Alexander was able to find new crashes precisely due to the increase in
this parameter. Also, we had ulimit -s set to 8MB as the default value.

# eval_const_expressions_mutator

max_stack_depth = '7000kB'

(n=10000; printf "select 'a' "; for ((i=1;i<$n;i++)); do printf "
collate \"C\" "; done; ) | psql

If you didn’t have a crash, like me, when Alexander shared his find,
then probably you configured your cluster with an optimization flag -Og.
In the process of trying to break this function, we came to the
conclusion that the maximum stack depth depends on the optimization flag
(-O0/-Og). As it turned out, when optimizing, the function frame on the
stack becomes smaller and because of this, the limit is reached more
slowly, therefore, the system can withstand more torment. Therefore,
this query will fail if you have a cluster configured with the -O0
optimization flag.

The crash of the next function not only depends on the optimization
flag, but also on a number of other things. While researching, we
noticed that postgres enforces a distance ~400kB from max_stack_depth to
ulimit -s. We thought we could hit the max_stack_depth limit and then
hit the OS limit as well. Therefore, Alexander wrote a recursive SQL
function, that eats up a stack within max_stack_depth, including a query
that eats up the remaining ~400kB. And this causes a crash.

# executeBoolItem

max_stack_depth = '7600kB'

create function infinite_recurse(i int) returns int as $$
begin
   raise notice 'Level %', i;
   begin
     perform jsonb_path_query('{"a":[1]}'::jsonb, ('$.a[*] ? (' ||
repeat('!(', 4800) || '@ == @' || repeat(')', 4800) || ')')::jsonpath);
   exception
     when others then raise notice 'jsonb_path_query error at level %,
%', i, sqlerrm;
   end;
   begin
     select infinite_recurse(i + 1) into i;
   exception
     when others then raise notice 'Max stack depth reached at level %,
%', i, sqlerrm;
   end;
   return i;
end;
$$ language plpgsql;

select infinite_recurse(1);

To sum it all up, we have not yet decided on a general approach to such
functions. Some functions are definitely subject to stack overflow. Some
are definitely not. This can be seen from the code where the recurse
flag is passed, or a function that checks the stack is called before a
recursive call. Some require special conditions - for example, you need
to parse the query and build a plan, and at that stage the stack is
eaten faster (and checked) than by the function that we are interested in.

We keep researching and hope to come up with a good solution sooner or
later.

Hello, in continuation of the topic of the stack overflow problem, Alexander Lakhin was able to find a few more similar places.

An important point for the first function is that the server must be built with asserts enabled, otherwise the crash will not happen.
Also, the result in the form of a server crash will be achieved only after 2-3 hours.

#MemoryContextCheck
(n=1000000; printf "begin;"; for ((i=1;i<=$n;i++)); do printf "savepoint s$i;"; done; printf "release s1;" ) | psql >/dev/null

Other functions could be crashed without asserts enabled.

#CommitTransactionCommand
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null

#MemoryContextStatsInternal
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SELECT pg_log_backend_memory_contexts(pg_backend_pid())") | psql >/dev/null

#ShowTransactionStateRec
(n=1000000; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SET log_min_messages = 'DEBUG5'; SAVEPOINT sp;") | psql >/dev/null

The following next two functions call each other; the following way was found to overflow the stack (with modified server configuration):

#MemoryContextDeleteChildren with MemoryContextDelete

max_connections = 1000
max_stack_depth = '7600kB'

create table idxpart (a int) partition by range (a);

select 'create index on idxpart (a)' from generate_series(1, 40000);
\gexec

create table idxpart (a int) partition by range (a);

select 'create index on idxpart (a)' from generate_series(1, 40000);
\gexec

create function infinite_recurse(level int) returns int as $$
declare l int;
begin
begin
select infinite_recurse(level + 1) into level;
exception
when others then raise notice 'Max stack depth reached at level %, %', level, sqlerrm;

create table idxpart1 partition of idxpart for values from (1) to (2) partition by range (a);

end;
return level;
end;
$$ language plpgsql;

select infinite_recurse(1);

Finally, there are yet two recursive functions in mcxt.c:

#MemoryContextResetChildren - could be vulnerable but not used at all after eaa5808e.

#MemoryContextMemAllocated - at present called only with local contexts.

#17

Sascha Kuhl

yogidabanli@gmail.com

over 3 years ago

In reply to: Egor Chindyaskin (#1)

Re: Stack overflow issue

Great work. Max Stack depth is memory dependent? Processor dependent?

Егор Чиндяскин <kyzevan23@mail.ru> schrieb am Mi., 24. Aug. 2022, 11:51:

Show quoted text

Hello, I recently got a server crash (bug #17583 [1]) caused by a stack
overflow.

Tom Lane and Richard Guo, in a discussion of this bug, suggested that
there could be more such places.
Therefore, Alexander Lakhin and I decided to deal with this issue and
Alexander developed a methodology. We processed src/backend/*/*.c with
"clang -emit-llvm ... | opt -analyze -print-calgraph" to find all the
functions that call themselves directly. I checked each of them for
features that protect against stack overflows.
We analyzed 4 catalogs: regex, tsearch, snowball and adt.
Firstly, we decided to test the regex catalog functions and found 6 of
them that lack the check_stach_depth() call.

zaptreesubs
markst
next
nfatree
numst
repeat

We have tried to exploit the recursion in the function zaptreesubs():
select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1',
11000) || ')?');

ERROR: invalid regular expression: regular expression is too complex

repeat():
select regexp_match('abc01234xyz',repeat('a{0,2}',100001));

ERROR: invalid regular expression: regular expression is too complex

numst():
select regexp_match('abc01234xyz',repeat('(.)\1e',100001));

ERROR: invalid regular expression: regular expression is too complex

markst():
markst is called in the code after v->tree = parse(...);
it is necessary that the tree be successfully parsed, but with a nesting
level of about 100,000 this will not work - stack protection will work
during parsing and v->ntree = numst(...); is also there.

next():
we were able to crash the server with the following query:
(printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<1000000;i++)); do
printf "(?#)"; done; printf "b')" ) | psql

Secondly, we have tried to exploit the recursion in the adt catalog
functions and Alexander was able to crash the server with the following
query:

regex_selectivity_sub():
SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 200000) ||
'b)');

And this query:

(n=100000;
printf "SELECT polygon '((0,0),(0,1000000))' <@ polygon
'((-200000,1000000),";
for ((i=1;i<$n;i++)); do printf "(100000,$(( 300000 +
$i))),(-100000,$((800000 + $i))),"; done;
printf "(200000,900000),(200000,0))';"
) | psql

Thirdly, the snowball catalog, Alexander has tried to exploit the
recursion in the r_stem_suffix_chain_before_ki function and crashed a
server using this query:

r_stem_suffix_chain_before_ki():
SELECT ts_lexize('turkish_stem', repeat('lerdeki', 1000000));

The last one is the tsearch catalog. We have found 4 functions that didn't
have check_stach_depth() function:

SplitToVariants
mkANode
mkSPNode
LexizeExec

We have tried to exploit the recursion in the SplitToVariants function and
Alexander crashed a server using this:

SplitToVariants():
CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell,
DictFile=ispell_sample,AffFile=ispell_sample);
SELECT ts_lexize('ispell', repeat('bally', 10000));

After trying to exploit the recursion in the LexizeExec function Alexander
made this conlusion:

LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and
"ld->curDictId != InvalidOid" (multiword mode) - we start with the first
one, then make recursive call to switch to the multiword mode, but then we
return to the usual mode again.

mkANode and mkSPNode deal with the dictionary structs, not with
user-supplied data, so we believe these functions are not vulnerable.

[1]
/messages/by-id/CAMbWs499ytQiH4mLMhRxRWP-iEUz3-DSinpAD-cUCtVo_23Wtg@mail.gmail.com

#18

Egor Chindyaskin

kyzevan23@mail.ru

over 3 years ago

In reply to: Sascha Kuhl (#17)

Re: Stack overflow issue

Hello! In continuation of the topic, I, under the leadership of
Alexander Lakhin, prepared patches that fix these problems.
We decided that these checks would be enough and put them in the places
we saw fit.

#19

Egor Chindyaskin

kyzevan23@mail.ru

over 3 years ago

In reply to: Sascha Kuhl (#17)

Re: Stack overflow issue

03.01.2023 22:45, Sascha Kuhl writes:

Great work. Max Stack depth is memory dependent? Processor dependent?

Hello! These situations are not specific to the x86_64 architecture, but
also manifest themselves, for example, on aarch64 architecture.
For example this query, ran on aarch64, (n=1000000;printf "begin;"; for
((i=1;i<=$n;i++)); do printf "savepoint s$i;"; done; printf "release
s1;" ) | psql > /dev/null
crashed the server on the savepoint174617 with the following stacktrace:

Core was generated by `postgres: test test [local]
SAVEPOINT '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 AllocSetCheck (context=<error reading variable: Cannot access memory
at address 0xffffe2397fe8>) at aset.c:1409
1409 {
(gdb) bt
#0 AllocSetCheck (context=<error reading variable: Cannot access memory
at address 0xffffe2397fe8>) at aset.c:1409
#1 0x0000aaaad78c38c4 in MemoryContextCheck (context=0xaaab39ee16a0) at
mcxt.c:740
#2 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39edf690) at
mcxt.c:742
#3 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39edd680) at
mcxt.c:742
#4 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39edb670) at
mcxt.c:742
#5 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed9660) at
mcxt.c:742
#6 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed7650) at
mcxt.c:742
#7 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed5640) at
mcxt.c:742
#8 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed3630) at
mcxt.c:742
#9 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ed1620) at
mcxt.c:742
#10 0x0000aaaad78c38dc in MemoryContextCheck (context=0xaaab39ecf610) at
mcxt.c:742
...
#174617 0x0000aaaad78c38dc in MemoryContextCheck
(context=0xaaaae47994b0) at mcxt.c:742
#174618 0x0000aaaad78c38dc in MemoryContextCheck
(context=0xaaaae476dcd0) at mcxt.c:742
#174619 0x0000aaaad78c38dc in MemoryContextCheck
(context=0xaaaae46ead50) at mcxt.c:742
#174620 0x0000aaaad76c7e24 in finish_xact_command () at postgres.c:2739
#174621 0x0000aaaad76c55b8 in exec_simple_query
(query_string=0xaaaae46f0540 "savepoint s174617;") at postgres.c:1238
#174622 0x0000aaaad76ca7a4 in PostgresMain (argc=1, argv=0xffffe2b96898,
dbname=0xaaaae471c098 "test", username=0xaaaae471c078 "test") at
postgres.c:4508
#174623 0x0000aaaad75e263c in BackendRun (port=0xaaaae4711470) at
postmaster.c:4530
#174624 0x0000aaaad75e1f70 in BackendStartup (port=0xaaaae4711470) at
postmaster.c:4252
#174625 0x0000aaaad75dd4c0 in ServerLoop () at postmaster.c:1745
#174626 0x0000aaaad75dcd3c in PostmasterMain (argc=3,
argv=0xaaaae46eacb0) at postmaster.c:1417
#174627 0x0000aaaad74d462c in main (argc=3, argv=0xaaaae46eacb0) at
main.c:209