Counting the occurences of a substring within a very large text

Started by Marc Maminalmost 11 years ago11 messagesgeneral
Jump to latest
#1Marc Mamin
M.Mamin@intershop.de

Hello,

I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.

Any idea for an alternative to this problem ?

select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;

ERROR: invalid memory alloc request size 1447215584

regards,

Marc Mamin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Chris Mair
chris@1006.org
In reply to: Marc Mamin (#1)
Re: Counting the occurences of a substring within a very large text

Hello,

I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.

Any idea for an alternative to this problem ?

select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;

ERROR: invalid memory alloc request size 1447215584

regards,

Marc Mamin

Hi,

what's the size of full_message from mytable where id =-2146999703?

Also: Postgres version? OS? any extensions installed?

Bye,
Chris.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#3Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Marc Mamin (#1)
Re: Counting the occurences of a substring within a very large text

Marc Mamin wrote:

I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.

Any idea for an alternative to this problem ?

select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;

ERROR: invalid memory alloc request size 1447215584

Does any of these two work:

SELECT length(regexp_replace(full_message, '[^\n]', '', 'g'))
FROM mytable
WHERE id = -2146999703;

or

SELECT length(full_message) - length(replace(full_message, E'\n', ''))
FROM mytable
WHERE id = -2146999703;

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Andy Colson
andy@squeakycode.net
In reply to: Marc Mamin (#1)
Re: Counting the occurences of a substring within a very large text

On 6/24/2015 5:55 AM, Marc Mamin wrote:

Hello,

I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.

Any idea for an alternative to this problem ?

select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;

ERROR: invalid memory alloc request size 1447215584

regards,

Marc Mamin

If its a large enough string, switching to plperl/plpython might give
you a pleasant speedup. There is a small overhead getting the string
to/from pg, but the string ops will be much faster.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#5Marc Mamin
M.Mamin@intershop.de
In reply to: Laurenz Albe (#3)
Re: Counting the occurences of a substring within a very large text

-----Original Message-----
From: Albe Laurenz [mailto:laurenz.albe@wien.gv.at]
Sent: Mittwoch, 24. Juni 2015 13:44
To: Marc Mamin; Postgres General
Subject: RE: Counting the occurences of a substring within a very large
text

Marc Mamin wrote:

I'd like to count the number linebreaks within a string, but I get a
memory allocation error when using regexp_matches or

regexp_split_to_table.

Any idea for an alternative to this problem ?

select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;

ERROR: invalid memory alloc request size 1447215584

Does any of these two work:

SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM
mytable WHERE id = -2146999703;

or

SELECT length(full_message) - length(replace(full_message, E'\n', ''))
FROM mytable WHERE id = -2146999703;

no, they both yeld the same error.

a new string functions for this would be nice, as it could certainly be implemented in a more efficient way...

BTW: the text to check is a single 350 MB error message from a log file :)

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Marc Mamin
M.Mamin@intershop.de
In reply to: Chris Mair (#2)
Re: Counting the occurences of a substring within a very large text

-----Original Message-----
From: Chris Mair [mailto:chris@1006.org]
Sent: Mittwoch, 24. Juni 2015 13:26
To: Marc Mamin; Postgres General
Subject: Re: [GENERAL] Counting the occurences of a substring within a
very large text

Hello,

I'd like to count the number linebreaks within a string, but I get a
memory allocation error when using regexp_matches or

regexp_split_to_table.

Any idea for an alternative to this problem ?

select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;

ERROR: invalid memory alloc request size 1447215584

regards,

Marc Mamin

Hi,

what's the size of full_message from mytable where id =-2146999703?

345 MB, Postgres 9.3.6

Marc Mamin

Also: Postgres version? OS? any extensions installed?

Bye,
Chris.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#7Marc Mamin
M.Mamin@intershop.de
In reply to: Marc Mamin (#5)
Re: Counting the occurences of a substring within a very large text

I'd like to count the number linebreaks within a string, but I get
a memory allocation error when using regexp_matches or

regexp_split_to_table.

Any idea for an alternative to this problem ?

select count(*)-1 from
( select regexp_split_to_table(full_message,'(\n)', 'g')
from mytable
where id =-2146999703
)foo;

ERROR: invalid memory alloc request size 1447215584

Does any of these two work:

SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM
mytable WHERE id = -2146999703;

or

SELECT length(full_message) - length(replace(full_message, E'\n',

''))

FROM mytable WHERE id = -2146999703;

no, they both yeld the same error.

and this fails too, which is more annoying as it looks like a bug:
SELECT replace(full_message, E'\n', '') FROM stadium_rprod.aserrorfull_20150623 WHERE id = -2146999703;

note that the 345MB text only contains 635 lines. This might be the issue...

Marc Mamin

a new string functions for this would be nice, as it could certainly be
implemented in a more efficient way...

BTW: the text to check is a single 350 MB error message from a
log file :)

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Geoff Winkless
pgsqladmin@geoff.dj
In reply to: Marc Mamin (#7)
Re: Counting the occurences of a substring within a very large text

On 24 June 2015 at 14:51, Marc Mamin <M.Mamin@intershop.de> wrote:

note that the 345MB text only contains 635 lines. This might be the
issue...

There's similar issue discussed here:

/messages/by-id/6046.1353874252@sss.pgh.pa.us

Tom did seem to accept that the attempted alloc is larger than it needs to
be, but even if it were fixed you would still run into problems further
down the line managing values of that size.

Geoff

#9Marc Mamin
M.Mamin@intershop.de
In reply to: Geoff Winkless (#8)
Re: Counting the occurences of a substring within a very large text

On 24 June 2015 at 14:51, Marc Mamin <M.Mamin@intershop.de> wrote:
note that the 345MB text only contains 635 lines. This might be the issue...

There's similar issue discussed here:

/messages/by-id/6046.1353874252@sss.pgh.pa.us

Tom did seem to accept that the attempted alloc is larger than it needs to be, but even if it were fixed you would still run into problems further down the line managing values of that size.

Thanks,

does it means that the issue would also occur in a (not yet existing) function that would just count the substring, without having to manipulate the text itself ?
or in other words, are there string functions that can handle such large text in stream, or is it always necessary to alloc enough memory for the whole text first ?

Marc

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Marc Mamin (#1)
Re: Counting the occurences of a substring within a very large text

Marc Mamin <M.Mamin@intershop.de> writes:

I'd like to count the number linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.

Anything involving a regexp is going to have a maximum input string length
of about 256MB, as a result of conversion to pg_wchar format.
regexp_split_to_table(), for instance, does this:

/* convert string to pg_wchar form for matching */
orig_len = VARSIZE_ANY_EXHDR(orig_str);
wide_str = (pg_wchar *) palloc(sizeof(pg_wchar) * (orig_len + 1));
wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len);

palloc() would complain for requests beyond 1GB, and sizeof(pg_wchar) is
4, so 256MB is the longest orig_str this can handle. (This coding is on
the hairy edge of being a security bug on 32-bit machines, too, but AFAICS
it is okay because there's a factor of 4 daylight between the max possible
input length of 1GB and overflowing uint32.)

In newer branches we could consider using MemoryContextAllocHuge to escape
the 1GB limit on wide_str, but it would take some research to be sure that
everything involved is using suitable datatypes for string indices and so
forth. I'd be a bit worried about the performance implications of
throwing such large strings around, anyway.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#11Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Marc Mamin (#5)
Re: Counting the occurences of a substring within a very large text

Marc Mamin wrote:

I'd like to count the number linebreaks within a string, but I get a
memory allocation error when using regexp_matches or regexp_split_to_table.

Does any of these two work:

[...]

no, they both yeld the same error.

a new string functions for this would be nice, as it could certainly be implemented in a more efficient way...

This is a rather special case.
But it should be easy to write your own C function that does this efficiently.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general