replace_text optimization (StringInfo to varlena)

Started by Daniel Veritealmost 7 years ago2 messages

daniel@manitou-mail.org

almost 7 years ago

1 attachment(s)

Hi,

replace_text() in varlena.c builds the result in a StringInfo buffer,
and finishes by copying it into a freshly allocated varlena structure
with cstring_to_text_with_len(), in the same memory context.

It looks like that copy step could be avoided by preprending the
varlena header to the StringInfo to begin with, and return the buffer
as a text*, as in the attached patch.

On large strings, the time saved can be significant. For instance
I'm seeing a ~20% decrease in total execution time on a test with
lengths in the 2-3 MB range, like this:

select sum(length(
replace(repeat('abcdefghijklmnopqrstuvwxyz', i*10), 'abc', 'ABC')
))
from generate_series(10000,12000) as i;

Also, at a glance, there are a few other functions with similar
StringInfo-to-varlena copies that seem avoidable:
concat_internal(), text_format(), replace_text_regexp().

Are there reasons not to do this? Otherwise, should it be considered
in in a more principled way, such as adding to the StringInfo API
functions like void InitStringInfoForVarlena() and
text *StringInfoAsVarlena()?

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Attachments:

replace-text-no-copy.patchtext/plainDownload

diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 693ccc5..3df54ed 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -4136,6 +4136,10 @@ replace_text(PG_FUNCTION_ARGS)
 
 	initStringInfo(&str);
 
+	/* allocate a varlena header at the start of the stringinfo */
+	enlargeStringInfo(&str, VARHDRSZ);
+	str.len += VARHDRSZ;
+
 	do
 	{
 		CHECK_FOR_INTERRUPTS();
@@ -4160,8 +4164,8 @@ replace_text(PG_FUNCTION_ARGS)
 
 	text_position_cleanup(&state);
 
-	ret_text = cstring_to_text_with_len(str.data, str.len);
-	pfree(str.data);
+	ret_text = (text*) str.data;
+	SET_VARSIZE(ret_text, str.len); /* VARHDRSZ is already included in str.len */
 
 	PG_RETURN_TEXT_P(ret_text);
 }

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Daniel Verite (#1)

Re: replace_text optimization (StringInfo to varlena)

At Wed, 13 Feb 2019 16:38:50 +0100, "Daniel Verite" <daniel@manitou-mail.org> wrote in <bc319ec6-60d0-4878-a800-bcc12a190c02@manitou-mail.org>

Hi,

replace_text() in varlena.c builds the result in a StringInfo buffer,
and finishes by copying it into a freshly allocated varlena structure
with cstring_to_text_with_len(), in the same memory context.

It looks like that copy step could be avoided by preprending the
varlena header to the StringInfo to begin with, and return the buffer
as a text*, as in the attached patch.

On large strings, the time saved can be significant. For instance
I'm seeing a ~20% decrease in total execution time on a test with
lengths in the 2-3 MB range, like this:

select sum(length(
replace(repeat('abcdefghijklmnopqrstuvwxyz', i*10), 'abc', 'ABC')
))
from generate_series(10000,12000) as i;

Also, at a glance, there are a few other functions with similar
StringInfo-to-varlena copies that seem avoidable:
concat_internal(), text_format(), replace_text_regexp().

Are there reasons not to do this? Otherwise, should it be considered
in in a more principled way, such as adding to the StringInfo API
functions like void InitStringInfoForVarlena() and
text *StringInfoAsVarlena()?

First, I agree that the waste of cycles should be eliminated.

Grepping with 'cstring_to_text_with_len\(.*[\.>]data,.*\)' shows
many instances of the use. Though StringInfo seems very
open-minded, the number of instances would be a good reason to
have new API functions.

That is, I vote for providing a set of API for the use in
StringInfo. But it seems to be difficult to name the latter
function. The name convention for the object is basically
<verb>StringInfo. getVarlenaStringInfo/getTextStringInfo
apparently fits the convention but seems to me a bit strange.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center