GSoC proposal

Started by Tan Tranalmost 12 years ago3 messages

tankimtran@gmail.com

almost 12 years ago

1 attachment(s)

Hi developers,

I'm applying for GSoC 2014 with Postgresql and would appreciate your
comments on my proposal (attached). I'm looking for technical
corrections/comments and your opinions on the project's viability. In
particular, if the community has doubts about its usefulness, I would start
working on an extra proposal from https://wiki.postgresql.org/wiki/GSoC_2014,
perhaps on the RETURNING clause as a student named Karlik did last year.

Thanks,
Tan Tran

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: Tan Tran (#1)

Re: GSoC proposal

On Feb28, 2014, at 05:29 , Tan Tran <tankimtran@gmail.com> wrote:

I'm applying for GSoC 2014 with Postgresql and would appreciate your comments
on my proposal (attached).
<pg_gsoc2014_TanTran.pdf>

First, please include your proposal as plain, inline text next time.
That makes it easier to quote the relevant parts when replying, and
also allows your mail to be indexed correctly by the mailing list
archive.

Regarding your proposal, I think you need to explain what exactly it
is you want to achieve in more detail.

In particular, text and bytea are EXTERNAL by default, so that substring
operations can seek straight to the exact slice (which is O(1)) instead
of de-toasting the whole datum (which is O(file size)). Specifically,
varlena.c’s text_substring(...) and bytea_substring(...) call
DatumGetTextPSlice(...), which r!etrieves only the slice(s) at an
easily-computed offset.!

...

1. First, I will optimize array element retrieval and UTF-8 substring
retrieval. Both are straightforward, as they involve calculating slice
numbers and using similar code to above.!

I'm confused by that - text_substring *already* attempts to only fetch
the relevant slice in the case of UTF-8. It can't do so precisely - it
needs to use a conservative estimate - but I fail to see how that can
be avoided. Since UTF-8 maps a character to anything from 1 to 6 bytes,
you can't compute the byte offset of a given character index precisely.

You could store a constant number of *characters* per slice, instead of
a constant number of *bytes*, but due to the rather large worst-case of
6 bytes per character, that would increase the storage and access overhead
6 fold for languages which can largely be represented with 1 byte per
character. That's not going to go down well...

I haven't looked at how we currently handle arrays, but the problems
there are similar. For arrays containing variable-length types, you can't
compute the byte offset from the index. It's even worst than for varchar,
because the range of possible element lengths is much longer - one array
element might be only a few bytes long, while another may be 1kB or more...

2. Second, I will implement a SPLITTER clause for the CREATE TYPE
statement. As 1 proposes, one would define a type, for example:
CREATE TYPE my_xml
LIKE xml
SPLITTER my_xml_splitter;

As far as I can tell, the idea is to allow a datatype to influence how
it's split into chunks for TOASTing so that functions can fetch only
the required slices more easily. To judge whether that is worthwhile or
not, you'd have to provide a concrete example of when such a facility
would be useful.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Albe Laurenz

laurenz.albe@wien.gv.at

almost 12 years ago

In reply to: Tan Tran (#1)

Re: GSoC proposal

I'm applying for GSoC 2014 with Postgresql and would appreciate your comments on my proposal
(attached). I'm looking for technical corrections/comments and your opinions on the project's
viability. In particular, if the community has doubts about its usefulness, I would start working on
an extra proposal from https://wiki.postgresql.org/wiki/GSoC_2014, perhaps on the RETURNING clause as
a student named Karlik did last year.

I am sure that Simon had his reasons when he proposed
/messages/by-id/CA+U5nMJGgJNt5VXqkR=crtDqXFmuyzwEF23-fD5NuSns+6N5dA@mail.gmail.com
but I cannot help asking some questions:

1) Why limit the feature to UTF8 strings?
Shouldn't the technique work for all multibyte server encodings?

2) There is probably something that makes this necessary, but why should the decision
how toast is sliced be attached to the data type?
My (probably naive) idea would be to add a new TOAST strategy (e.g. SLICED)
to PLAIN, MAIN, EXTERNAL and EXTENDED.

The feature only makes sense for string data types, right?

Yours,
Laurenz Albe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

GSoC proposal

Attachments: