Unicode comment on Postgres vs Sql Server

Started by Swaminathan Saikumarabout 18 years ago6 messagesgeneral

swami@giveexam.com

about 18 years ago

I am familiar with MS Sql Server & just started using Postgres.
For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses
char/varchar for ASCII.
Postgres has this encoding setting at the database level.

I am using UTF8 Unicode for most of my data, but there is some data that I
know for sure will be ASCII. However, this is also stored as UTF8, using up
more space.

At first sight, it looks like the the more granular level design is better.
Any comments? If you agree, does it make sense to add this as a new datatype
to Postgres?

Thanks

Stephane Bortzmeyer

bortzmeyer@nic.fr

about 18 years ago

In reply to: Swaminathan Saikumar (#1)

Re: Unicode comment on Postgres vs Sql Server

On Sun, Mar 02, 2008 at 11:50:01AM -0800,
Swaminathan Saikumar <swami@giveexam.com> wrote
a message of 30 lines which said:

Postgres has this encoding setting at the database level.

Which is simpler, IMHO. "One encoding to rule them all"

I am using UTF8 Unicode for most of my data, but there is some data
that I know for sure will be ASCII. However, this is also stored as
UTF8, using up more space.

Excuse me, but this shows a serious ignorance of UTF-8. A character of
the ASCII range, in UTF-8, is stored in one byte, exactly the same
size as ASCII (any ASCII file is an UTF-8 file, that's an important
property of UTF-8).

Leif B. Kristensen

leif@solumslekt.org

about 18 years ago

In reply to: Swaminathan Saikumar (#1)

Re: Unicode comment on Postgres vs Sql Server

On Sunday 2. March 2008, Swaminathan Saikumar wrote:

I am using UTF8 Unicode for most of my data, but there is some data
that I know for sure will be ASCII. However, this is also stored as
UTF8, using up more space.

ASCII stored as UTF8 doesn't take up more space than plain ASCII, it's
exactly the same thing. It's one byte per character unless the
character number is above 127.
--
Leif Biberg Kristensen | Registered Linux User #338009
http://solumslekt.org/ | Cruising with Gentoo/KDE
My Jazz Jukebox: http://www.last.fm/user/leifbk/

Tino Wildenhain

tino@wildenhain.de

about 18 years ago

In reply to: Swaminathan Saikumar (#1)

Re: Unicode comment on Postgres vs Sql Server

Swaminathan Saikumar wrote:

I am familiar with MS Sql Server & just started using Postgres.
For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses
char/varchar for ASCII.
Postgres has this encoding setting at the database level.

I am using UTF8 Unicode for most of my data, but there is some data that
I know for sure will be ASCII. However, this is also stored as UTF8,
using up more space.

This is wrong - ASCII is a subset of UTF8 and therefore uses
exactly one byte for every ASCII char.

See http://en.wikipedia.org/wiki/UTF-8 for example.

At first sight, it looks like the the more granular level design is
better. Any comments? If you agree, does it make sense to add this as a
new datatype to Postgres?

Which new datatype?

Regards
Tino

Swaminathan Saikumar

swami@giveexam.com

about 18 years ago

In reply to: Tino Wildenhain (#4)

Re: Unicode comment on Postgres vs Sql Server

I didn't have proper knowledge about the UTF8 format, thanks.
I originally meant nvarchar & nchar, which is basically varchar & char that
supports Unicode regardless of the database encoding.

Show quoted text

On 3/2/08, Tino Wildenhain <tino@wildenhain.de> wrote:

Swaminathan Saikumar wrote:

I am familiar with MS Sql Server & just started using Postgres.
For storing Unicode, Sql Server uses nvarchar/char for unicode, and uses
char/varchar for ASCII.
Postgres has this encoding setting at the database level.

I am using UTF8 Unicode for most of my data, but there is some data that
I know for sure will be ASCII. However, this is also stored as UTF8,
using up more space.

This is wrong - ASCII is a subset of UTF8 and therefore uses
exactly one byte for every ASCII char.

See http://en.wikipedia.org/wiki/UTF-8 for example.

At first sight, it looks like the the more granular level design is
better. Any comments? If you agree, does it make sense to add this as a
new datatype to Postgres?

Which new datatype?

Regards

Tino

Tino Wildenhain

tino@wildenhain.de

about 18 years ago

In reply to: Swaminathan Saikumar (#5)

Re: Unicode comment on Postgres vs Sql Server

Swaminathan Saikumar wrote:

I didn't have proper knowledge about the UTF8 format, thanks.
I originally meant nvarchar & nchar, which is basically varchar & char
that supports Unicode regardless of the database encoding.

Well, we don't need that when we have UTF8. There could be edge cases
speed wise when you use UCS16 or UCS32 internally but I'm not sure
how well this would justify a new datatype.

The current problem isnt so much with encoding database wise, its more
about collating database cluster wise - which is something not
easily solved when you want to do it according to the SQL spec.

You could work around that with a functional index.

Regards
Tino Wildenhain