Counting the occurences of a substring within a very large text
Hello,
I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.
Any idea for an alternative to this problem ?
select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;
ERROR: invalid memory alloc request size 1447215584
regards,
Marc Mamin
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hello,
I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.Any idea for an alternative to this problem ?
select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;ERROR: invalid memory alloc request size 1447215584
regards,
Marc Mamin
Hi,
what's the size of full_message from mytable where id =-2146999703?
Also: Postgres version? OS? any extensions installed?
Bye,
Chris.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Marc Mamin wrote:
I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.Any idea for an alternative to this problem ?
select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;ERROR: invalid memory alloc request size 1447215584
Does any of these two work:
SELECT length(regexp_replace(full_message, '[^\n]', '', 'g'))
FROM mytable
WHERE id = -2146999703;
or
SELECT length(full_message) - length(replace(full_message, E'\n', ''))
FROM mytable
WHERE id = -2146999703;
Yours,
Laurenz Albe
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 6/24/2015 5:55 AM, Marc Mamin wrote:
Hello,
I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.Any idea for an alternative to this problem ?
select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;ERROR: invalid memory alloc request size 1447215584
regards,
Marc Mamin
If its a large enough string, switching to plperl/plpython might give
you a pleasant speedup. There is a small overhead getting the string
to/from pg, but the string ops will be much faster.
-Andy
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
-----Original Message-----
From: Albe Laurenz [mailto:laurenz.albe@wien.gv.at]
Sent: Mittwoch, 24. Juni 2015 13:44
To: Marc Mamin; Postgres General
Subject: RE: Counting the occurences of a substring within a very large
textMarc Mamin wrote:
I'd like to count the number linebreaks within a string, but I get a
memory allocation error when using regexp_matches orregexp_split_to_table.
Any idea for an alternative to this problem ?
select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;ERROR: invalid memory alloc request size 1447215584
Does any of these two work:
SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM
mytable WHERE id = -2146999703;or
SELECT length(full_message) - length(replace(full_message, E'\n', ''))
FROM mytable WHERE id = -2146999703;
no, they both yeld the same error.
a new string functions for this would be nice, as it could certainly be implemented in a more efficient way...
BTW: the text to check is a single 350 MB error message from a log file :)
Yours,
Laurenz Albe
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
-----Original Message-----
From: Chris Mair [mailto:chris@1006.org]
Sent: Mittwoch, 24. Juni 2015 13:26
To: Marc Mamin; Postgres General
Subject: Re: [GENERAL] Counting the occurences of a substring within a
very large textHello,
I'd like to count the number linebreaks within a string, but I get a
memory allocation error when using regexp_matches orregexp_split_to_table.
Any idea for an alternative to this problem ?
select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;ERROR: invalid memory alloc request size 1447215584
regards,
Marc Mamin
Hi,
what's the size of full_message from mytable where id =-2146999703?
345 MB, Postgres 9.3.6
Marc Mamin
Also: Postgres version? OS? any extensions installed?
Bye,
Chris.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
I'd like to count the number linebreaks within a string, but I get
a memory allocation error when using regexp_matches orregexp_split_to_table.
Any idea for an alternative to this problem ?
select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;ERROR: invalid memory alloc request size 1447215584
Does any of these two work:
SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM
mytable WHERE id = -2146999703;or
SELECT length(full_message) - length(replace(full_message, E'\n',
''))
FROM mytable WHERE id = -2146999703;
no, they both yeld the same error.
and this fails too, which is more annoying as it looks like a bug:
SELECT replace(full_message, E'\n', '') FROM stadium_rprod.aserrorfull_20150623 WHERE id = -2146999703;
note that the 345MB text only contains 635 lines. This might be the issue...
Marc Mamin
a new string functions for this would be nice, as it could certainly be
implemented in a more efficient way...BTW: the text to check is a single 350 MB error message from a
log file :)Yours,
Laurenz Albe
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On 24 June 2015 at 14:51, Marc Mamin <M.Mamin@intershop.de> wrote:
note that the 345MB text only contains 635 lines. This might be the
issue...
There's similar issue discussed here:
/messages/by-id/6046.1353874252@sss.pgh.pa.us
Tom did seem to accept that the attempted alloc is larger than it needs to
be, but even if it were fixed you would still run into problems further
down the line managing values of that size.
Geoff
On 24 June 2015 at 14:51, Marc Mamin <M.Mamin@intershop.de> wrote:
note that the 345MB text only contains 635 lines. This might be the issue...
There's similar issue discussed here:
Tom did seem to accept that the attempted alloc is larger than it needs to be, but even if it were fixed you would still run into problems further down the line managing values of that size.
Thanks,
does it means that the issue would also occur in a (not yet existing) function that would just count the substring, without having to manipulate the text itself ?
or in other words, are there string functions that can handle such large text in stream, or is it always necessary to alloc enough memory for the whole text first ?
Marc
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Marc Mamin <M.Mamin@intershop.de> writes:
I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.
Anything involving a regexp is going to have a maximum input string length
of about 256MB, as a result of conversion to pg_wchar format.
regexp_split_to_table(), for instance, does this:
/* convert string to pg_wchar form for matching */
orig_len = VARSIZE_ANY_EXHDR(orig_str);
wide_str = (pg_wchar *) palloc(sizeof(pg_wchar) * (orig_len + 1));
wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len);
palloc() would complain for requests beyond 1GB, and sizeof(pg_wchar) is
4, so 256MB is the longest orig_str this can handle. (This coding is on
the hairy edge of being a security bug on 32-bit machines, too, but AFAICS
it is okay because there's a factor of 4 daylight between the max possible
input length of 1GB and overflowing uint32.)
In newer branches we could consider using MemoryContextAllocHuge to escape
the 1GB limit on wide_str, but it would take some research to be sure that
everything involved is using suitable datatypes for string indices and so
forth. I'd be a bit worried about the performance implications of
throwing such large strings around, anyway.
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Marc Mamin wrote:
I'd like to count the number linebreaks within a string, but I get a
memory allocation error when using regexp_matches or regexp_split_to_table.
Does any of these two work:
[...]
no, they both yeld the same error.
a new string functions for this would be nice, as it could certainly be implemented in a more efficient way...
This is a rather special case.
But it should be easy to write your own C function that does this efficiently.
Yours,
Laurenz Albe
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general