GSoC project : K-medoids clustering in Madlib

Started by viodalmost 13 years ago8 messages
#1viod
viod.len@gmail.com

Hello!

I'm an IT student, and I would like to apply for the 2013 GSoC.
I've been looking at this mailing list for a while now, and I saw a
suggestion for GSoC that particularly interested me: implementing the
K-medoids clustering in Madlib, as it is supposed to be more efficient than
the K-means algorithm.

I didn't know about these algorithms before, but I have documented myself,
and it looks quite interesting to me, and even more as I currently have
lessons (but very very simplified unfortunately).

I've got a few questions:
Won't this be a quite short project? I can't get an idea of how long it
would take me to implement this algorithm in a way that would be usable by
postgresql, but 3 months looks long for this task, doesn't it?

Someone on the IRC channel (can't remember who, sorry) told me it was used
in the KNN index. I guess this is used by pg_trgm, but are there other
modules using it currently?
And could you please give me some links explaining the internals of this
index? I've been through several articles presenting of it, but none very
satisfying.

Thanks a lot in advance!

#2Atri Sharma
atri.jiit@gmail.com
In reply to: viod (#1)
Re: GSoC project : K-medoids clustering in Madlib

I suggested a couple of algorithms to be implemented in MADLib(apart
from K Medoids). You could pick some(or all) of them, which would
require 3 months to be completed.

As for more information on index, you can refer

http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.1

along with the postgres wiki. The wiki is the standard for anything postgres.

pg_trgm used KNN, but I believe it uses its own implementation of the
algorithm. The idea I proposed aims at writing an implementation in
the MADlib so that any client program can use the algorithm(s) in
their code directly, using MADlib functions.

Regards,

Atri

On 3/26/13, viod <viod.len@gmail.com> wrote:

Hello!

I'm an IT student, and I would like to apply for the 2013 GSoC.
I've been looking at this mailing list for a while now, and I saw a
suggestion for GSoC that particularly interested me: implementing the
K-medoids clustering in Madlib, as it is supposed to be more efficient than
the K-means algorithm.

I didn't know about these algorithms before, but I have documented myself,
and it looks quite interesting to me, and even more as I currently have
lessons (but very very simplified unfortunately).

I've got a few questions:
Won't this be a quite short project? I can't get an idea of how long it
would take me to implement this algorithm in a way that would be usable by
postgresql, but 3 months looks long for this task, doesn't it?

Someone on the IRC channel (can't remember who, sorry) told me it was used
in the KNN index. I guess this is used by pg_trgm, but are there other
modules using it currently?
And could you please give me some links explaining the internals of this
index? I've been through several articles presenting of it, but none very
satisfying.

Thanks a lot in advance!

--
Regards,

Atri
*l'apprenant*

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Atri Sharma (#2)
Re: GSoC project : K-medoids clustering in Madlib

Atri Sharma <atri.jiit@gmail.com> writes:

I suggested a couple of algorithms to be implemented in MADLib(apart
from K Medoids). You could pick some(or all) of them, which would
require 3 months to be completed.

As for more information on index, you can refer

http://wiki.postgresql.org/wiki/What&#39;s_new_in_PostgreSQL_9.1

along with the postgres wiki. The wiki is the standard for anything postgres.

pg_trgm used KNN, but I believe it uses its own implementation of the
algorithm. The idea I proposed aims at writing an implementation in
the MADlib so that any client program can use the algorithm(s) in
their code directly, using MADlib functions.

I'm a bit confused as to why this is being proposed as a
Postgres-related project. I don't even know what MADlib is, but I'm
pretty darn sure that no part of Postgres uses it. KNNGist certainly
doesn't.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Daniel Farina
daniel@heroku.com
In reply to: Tom Lane (#3)
Re: GSoC project : K-medoids clustering in Madlib

On Tue, Mar 26, 2013 at 10:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Atri Sharma <atri.jiit@gmail.com> writes:

I suggested a couple of algorithms to be implemented in MADLib(apart
from K Medoids). You could pick some(or all) of them, which would
require 3 months to be completed.

As for more information on index, you can refer

http://wiki.postgresql.org/wiki/What&#39;s_new_in_PostgreSQL_9.1

along with the postgres wiki. The wiki is the standard for anything postgres.

pg_trgm used KNN, but I believe it uses its own implementation of the
algorithm. The idea I proposed aims at writing an implementation in
the MADlib so that any client program can use the algorithm(s) in
their code directly, using MADlib functions.

I'm a bit confused as to why this is being proposed as a
Postgres-related project. I don't even know what MADlib is, but I'm
pretty darn sure that no part of Postgres uses it. KNNGist certainly
doesn't.

It's a reasonably well established extension for Postgres for
statistical and machine learning methods. Rather neat, but as you
indicate, it's not part of Postgres proper.

http://madlib.net/

https://github.com/madlib/madlib/

--
fdr

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Atri Sharma
atri.jiit@gmail.com
In reply to: Daniel Farina (#4)
Re: GSoC project : K-medoids clustering in Madlib

I'm a bit confused as to why this is being proposed as a
Postgres-related project. I don't even know what MADlib is, but I'm
pretty darn sure that no part of Postgres uses it. KNNGist certainly
doesn't.

It's a reasonably well established extension for Postgres for
statistical and machine learning methods. Rather neat, but as you
indicate, it's not part of Postgres proper.

http://madlib.net/

https://github.com/madlib/madlib/

It is the extension that is normally referred to when we talk about
data analytics in Postgres. As you said, it is not part of postgres
proper,but IMO, if we want to extend the data analytics
functionalities of postgres, we need to work on MADlib.

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Atri Sharma (#5)
Re: GSoC project : K-medoids clustering in Madlib

On 27.03.2013 08:51, Atri Sharma wrote:

I'm a bit confused as to why this is being proposed as a
Postgres-related project. I don't even know what MADlib is, but I'm
pretty darn sure that no part of Postgres uses it. KNNGist certainly
doesn't.

It's a reasonably well established extension for Postgres for
statistical and machine learning methods. Rather neat, but as you
indicate, it's not part of Postgres proper.

http://madlib.net/

https://github.com/madlib/madlib/

It is the extension that is normally referred to when we talk about
data analytics in Postgres. As you said, it is not part of postgres
proper,but IMO, if we want to extend the data analytics
functionalities of postgres, we need to work on MADlib.

Perhaps we could do this under the PostgreSQL organization, but we'd
definitely need to get someone from the MADLib project to mentor it.

But it would be even better if MADLib would apply to GSoC as an
independent organization. The deadline for organization applications is
on March 29th, so if the MADLIb people are interested in that, they need
to hurry and send the application right now.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Atri Sharma
atri.jiit@gmail.com
In reply to: Heikki Linnakangas (#6)
Re: GSoC project : K-medoids clustering in Madlib

But it would be even better if MADLib would apply to GSoC as an independent
organization. The deadline for organization applications is on March 29th,
so if the MADLIb people are interested in that, they need to hurry and send
the application right now.

Agreed. Is there any way we could add in house support for basic data
analytics,maybe as a proper postgres extension?

Regards,

Atri

--
Regards,

Atri
l'apprenant

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Thom Brown
thom@linux.com
In reply to: Heikki Linnakangas (#6)
Re: GSoC project : K-medoids clustering in Madlib

On 27 March 2013 08:12, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 27.03.2013 08:51, Atri Sharma wrote:

I'm a bit confused as to why this is being proposed as a
Postgres-related project. I don't even know what MADlib is, but I'm
pretty darn sure that no part of Postgres uses it. KNNGist certainly
doesn't.

It's a reasonably well established extension for Postgres for
statistical and machine learning methods. Rather neat, but as you
indicate, it's not part of Postgres proper.

http://madlib.net/

https://github.com/madlib/madlib/

It is the extension that is normally referred to when we talk about
data analytics in Postgres. As you said, it is not part of postgres
proper,but IMO, if we want to extend the data analytics
functionalities of postgres, we need to work on MADlib.

Perhaps we could do this under the PostgreSQL organization, but we'd
definitely need to get someone from the MADLib project to mentor it.

But it would be even better if MADLib would apply to GSoC as an independent
organization. The deadline for organization applications is on March 29th,
so if the MADLIb people are interested in that, they need to hurry and send
the application right now.

It would also help if they were able to get in contact so that I could
add them as a project we'd vouch for as part of our own application.

--
Thom

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers