Re: How are null's stored?

Started by Ryanover 22 years ago8 messages
#1Ryan
pgsql-performance@seahat.com

Jim,

I have a 40M row table I need to import data into, then use to create
a bunch of more normalized tables. Right now all fields are varchar,
but I'm going to change this so that fields that are less than a
certain size are just char. Question is, how much impact is there from
char being nullable vs. not nullable? src/include/access/htup.h
indicates that nulls are stored in a bitmap, so I'd suspect that I
should see a decent space savings from not having to include length
information all the time... (most of these small fields are always the
same size no matter what...)

This is moot. PostgreSQL stores CHAR(x), VARCHAR, and TEXT in the same
internal format, which includes length information in the page header.
So you save no storage space by converting to CHAR(x) ... you might
even make your tables *larger* because of the space padding.

So if the internal format is identical, why does the INFERNAL database
ignore indexes when you have a text compared to a varchar?

Ryan

#2Andrew Sullivan
andrew@libertyrms.info
In reply to: Ryan (#1)

On Mon, May 12, 2003 at 01:58:03PM -0500, Ryan wrote:

So if the internal format is identical, why does the INFERNAL database
ignore indexes when you have a text compared to a varchar?

Because the rules for handling the two data types are not the same.
Since spaces are significant on char(n) according to the spec, you
have strange rules in their handling.

Short answer: use text. Varchar(n) if you must, to limit length.
But char(n) is almost always evil.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110
#3Josh Berkus
josh@agliodbs.com
In reply to: Ryan (#1)

Ryan,

So if the internal format is identical, why does the INFERNAL database
ignore indexes when you have a text compared to a varchar?

I don't seem to have this problem; I use TEXT or VARCHAR willy-nilly,
including in LIKE 'string%' and UPPER(field) queries, and the indexes work
fine.

I suspect that either you're talking about TEXT to CHAR(x) comparisons, which
are a different ball o' wax, or your query problem is something else.

--
-Josh Berkus
Aglio Database Solutions
San Francisco

#4Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Josh Berkus (#3)

On Mon, 12 May 2003, Josh Berkus wrote:

So if the internal format is identical, why does the INFERNAL database
ignore indexes when you have a text compared to a varchar?

I don't seem to have this problem; I use TEXT or VARCHAR willy-nilly,
including in LIKE 'string%' and UPPER(field) queries, and the indexes work
fine.

I can get the case he's complaining about with some cases I believe.

With an indexed varchar field, I can get 7.3.1 to give me:
sszabo=# set enable_seqscan=off;
SET
sszabo=# explain select * from aq2 where a=('f' || 'g');
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on aq2 (cost=100000000.00..100000022.50 rows=1 width=168)
Filter: ((a)::text = 'fg'::text)

but

sszabo=# explain select * from aq2 where a=('f' || 'g')::varchar;
QUERY PLAN
----------------------------------------------------------------------
Index Scan using aq2_pkey on aq2 (cost=0.00..4.82 rows=1 width=168)
Index Cond: (a = 'fg'::character varying)

or

sszabo=# explain select * from aq2 where a=('f' || 'g'::varchar);
QUERY PLAN
----------------------------------------------------------------------
Index Scan using aq2_pkey on aq2 (cost=0.00..4.82 rows=1 width=168)
Index Cond: (a = 'fg'::character varying)

All in all, I'm not sure what the semantic differences between a varchar
with no length specified and a text are in PostgreSQL actually and if the
whole thing could be simplified in some way that doesn't break backwards
compatibility.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephan Szabo (#4)

Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:

All in all, I'm not sure what the semantic differences between a varchar
with no length specified and a text are in PostgreSQL actually and if the
whole thing could be simplified in some way that doesn't break backwards
compatibility.

Yeah, I've been wondering about that too. A large part of the problem
is that varchar has its own set of operators, which the planner has no
right to assume behave exactly like the text ones ... but they do. It
might work to rip out the redundant varchar operators and allow indexes
on varchar to become truly textual indexes (ie, they'd be text_ops not
varchar_ops opclass). There might be a few tweaks needed to get the
planner to play nice with indexes that require implicit coercions, but
I think it could be made to work.

Another idea that has been rattling around is to stop treating bpchar as
binary-equivalent to text, and in fact to make bpchar-to-text promotion
go through rtrim() to eliminate padding spaces.

I think this stuff got put on hold because we haven't been able to come
up with a good solution for the comparable problems in the numeric
datatype hierarchy. But bpchar/varchar/text is a lot simpler problem,
and maybe could be solved with the tools we have in place already.

regards, tom lane

#6Stephan Szabo
sszabo@megazone23.bigpanda.com
In reply to: Tom Lane (#5)

On Mon, 12 May 2003, Tom Lane wrote:

Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:

All in all, I'm not sure what the semantic differences between a varchar
with no length specified and a text are in PostgreSQL actually and if the
whole thing could be simplified in some way that doesn't break backwards
compatibility.

Yeah, I've been wondering about that too. A large part of the problem
is that varchar has its own set of operators, which the planner has no
right to assume behave exactly like the text ones ... but they do. It
might work to rip out the redundant varchar operators and allow indexes
on varchar to become truly textual indexes (ie, they'd be text_ops not
varchar_ops opclass). There might be a few tweaks needed to get the
planner to play nice with indexes that require implicit coercions, but
I think it could be made to work.

This seems to possibly work on 7.4. I took my system and removed the
varchar comparison operators and directly made a text_ops index on a
varchar(30).
That gave me indexscans for
col = 'a'
col = 'a'::varchar
col = 'a'::text
col = 'a' || 'b'

but I don't know if it has other bad effects yet.

Another idea that has been rattling around is to stop treating bpchar as
binary-equivalent to text, and in fact to make bpchar-to-text promotion
go through rtrim() to eliminate padding spaces.

I guess this depends on how we read the comparisons/conversions from PAD
SPACE to NO PAD are supposed to work, but I think this would be good and
make things easier for alot of people since most people don't expect it,
especially when using functions like upper and lower that return text.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephan Szabo (#6)

Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:

On Mon, 12 May 2003, Tom Lane wrote:

It might work to rip out the redundant varchar operators and allow indexes
on varchar to become truly textual indexes (ie, they'd be text_ops not
varchar_ops opclass).

This seems to possibly work on 7.4. I took my system and removed the
varchar comparison operators and directly made a text_ops index on a
varchar(30).

Yeah, I fooled with it a little bit last night too. It seems that we'd
need to still have a varchar_ops entry in pg_opclass (else you get
complaints about unable to select a default opclass, not to mention that
old pg_dump files specifying varchar_ops would fail to load). But this
entry could point to the textual comparison operators. AFAICT the
planner doesn't have any problem dealing with the implicit coercions
that it's faced with in such cases.

Another idea that has been rattling around is to stop treating bpchar as
binary-equivalent to text, and in fact to make bpchar-to-text promotion
go through rtrim() to eliminate padding spaces.

I guess this depends on how we read the comparisons/conversions from PAD
SPACE to NO PAD are supposed to work, but I think this would be good and
make things easier for alot of people since most people don't expect it,
especially when using functions like upper and lower that return text.

I tried that too, and it seemed to work as expected. Whether it's
arguably more spec-compliant than our current behavior I dunno; haven't
looked at that part of the spec closely...

regards, tom lane

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephan Szabo (#6)
Simplifying varchar and bpchar behavior

We have had a couple of threads recently on improving
bpchar/varchar/text behavior by making bpchar-to-text promotion
go through rtrim() (instead of being a straight binary-compatible
conversion) and getting rid of redundant operators:

http://archives.postgresql.org/pgsql-hackers/2002-11/msg00703.php
http://archives.postgresql.org/pgsql-performance/2003-05/msg00151.php

I'm going to go ahead and make these changes for 7.4, as they will
clearly improve the intuitiveness of the behavior. I don't think they
move us any closer to spec compliance --- the spec appears to require
a notion of a collation sequence that can be specified independently
of the character datatype, which is something we don't have and aren't
likely to have very soon. (It's not clear to me that it's actually
*useful* to specify NO PAD collation with fixed-width character data,
or PAD SPACE with varchar data, but the spec lets you do it.) In the
meantime though these changes seem to be a win, and they will not leave
us any worse off when we do get around to implementing collations.

We speculated about a couple of alternative solutions in the first of
the above-mentioned threads, but they didn't look nearly as practical
to implement as this way.

Last call for objections ...

regards, tom lane