X-From-Line: mailagent Sun Nov 14 04:58:29 EST 2004
Return-Path: <pgsql-hackers-owner+M61061@postgresql.org>
Received: from po11.mit.edu [18.7.21.73]
	by stark.xeocode.com with POP3 (fetchmail-5.9.7)
	for stark@localhost (single-drop); Sun, 14 Nov 2004 04:58:29 -0500 (EST)
Received: from po11.mit.edu (po11.mit.edu [18.7.21.73])
	by po11.mit.edu (Cyrus v2.1.5) with LMTP;
	Sun, 14 Nov 2004 04:26:49 -0500
X-Sieve: CMU Sieve 2.2
Received: from pacific-carrier-annex.mit.edu by po11.mit.edu (8.12.4/4.7) id
	iAE9QgWp018623; Sun, 14 Nov 2004 04:26:42 -0500 (EST)
Received: from hosting.commandprompt.com (128.commandprompt.com
	[207.173.200.128])
	by pacific-carrier-annex.mit.edu (8.12.4/8.9.2) with ESMTP id
	iAE9PpER013498
	for <gsstark@mit.edu>; Sun, 14 Nov 2004 04:25:52 -0500 (EST)
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
	by hosting.commandprompt.com (8.11.6/8.11.6) with ESMTP id iAE9NVE20660;
	Sun, 14 Nov 2004 01:24:44 -0800
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
Received: from localhost (unknown [200.46.204.144])
	by svr1.postgresql.org (Postfix) with ESMTP id BAD2F3A42AA
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
	Sun, 14 Nov 2004 09:22:01 +0000 (GMT)
Received: from svr1.postgresql.org ([200.46.204.71])
	by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024)
	with ESMTP id 21477-02
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
	Sun, 14 Nov 2004 09:21:46 +0000 (GMT)
Received: from sue.samurai.com (sue.samurai.com [205.207.28.74])
	by svr1.postgresql.org (Postfix) with ESMTP id 1417B3A4C57
	for <pgsql-hackers@postgresql.org>;
	Sun, 14 Nov 2004 09:21:47 +0000 (GMT)
Received: from localhost (localhost [127.0.0.1])
	by sue.samurai.com (Postfix) with ESMTP id 8B36C197D7;
	Sun, 14 Nov 2004 04:21:47 -0500 (EST)
Received: from sue.samurai.com ([127.0.0.1])
	by localhost (sue.samurai.com [127.0.0.1]) (amavisd-new, port 10024)
	with LMTP id 54314-02-7; Sun, 14 Nov 2004 04:21:46 -0500 (EST)
Received: from [220.101.5.64] (r220-101-5-64.cpe.unwired.net.au
 [220.101.5.64])
	by sue.samurai.com (Postfix) with ESMTP id DDC11197B2;
	Sun, 14 Nov 2004 04:21:44 -0500 (EST)
X-Gnus-Mail-Source: directory:~/incoming
Message-ID: <419723A7.5040505@samurai.com>
Date: Sun, 14 Nov 2004 20:21:43 +1100
From: Neil Conway <neilc@samurai.com>
User-Agent: Mozilla Thunderbird 0.9 (Macintosh/20041103)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: PostgreSQL-development <pgsql-hackers@postgresql.org>
Cc: Oleg Bartunov <oleg@sai.msu.su>, Teodor Sigaev <teodor@stack.net>
Subject: [HACKERS] GiST: PickSplit and multi-attr indexes
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavisd-new at mailbox.samurai.com
X-Virus-Scanned: by amavisd-new at hub.org
X-Spam-Status: No, hits=0.0 tagged_above=0.0 required=5.0 tests=
X-Spam-Level: 
X-Mailing-List: pgsql-hackers
Precedence: bulk
Sender: pgsql-hackers-owner@postgresql.org
X-Scanned-By: MIMEDefang 2.42
X-Spam-Score: -4.9
X-Spam-Flag: NO
X-Mailagent-Processed: Sun, 14 Nov 2004 04:59:01 -0500
X-Mailagent-Processed: Job:157724 File:fm9653
X-Bogosity: No, tests=bogofilter, spamicity=0.034006, version=0.17.5
X-Mailagent-Delivered-To: pgsql-hackers
X-Filter: mailagent [version 3.0 PL73] for gsstark@mit.edu
Lines: 33
Xref: stark.xeocode.com list.pgsql-hackers:31466

Oleg & Teodor,

If I understand the code correctly, GiST will only pass the first 
attribute of each index tuple to the user-defined PickSplit method when 
it wants to split a node. (see circa line 1269 of gist.c)

Is this a wise design decision? Granted, in many situations the first 
attribute in the index is sufficient to make a reasonable decision about 
how to divide the node into two halves, but I don't think that is 
universally true. For example, consider a two column index whose first 
attribute has a small number of distinct values. It could well be that 
all the first attribute values in a node-to-be-split would be the same. 
Only passing the first attribute to PickSplit would result in an 
essentially random distribution of tuples among the split nodes, rather 
than allowing the GiST extension to make use of the second attribution 
to partition the nodes. That's an extreme example, but it is easy to 
construct more realistic scenarios (basically, any situation in which 
the cardinality of the first index attribute is low -- a reasonably 
common occurrence with a multi-column index, I believe).

I'm not sure whether this would be a problem in practice. Speculation: 
repeated invocations of PickSplit are one of the main factors in 
deriving the ultimate shape of the GiST tree. Poor distribution of keys 
resulting from PickSplit would eventually result in unnecessarily loose 
bounding predicates in internal nodes, which would hurt performance.

Comments welcome,

Neil

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings


