Unicode database + JDBC driver performance

Started by Jan Ploskiover 23 years ago12 messagesgeneral
Jump to latest
#1Jan Ploski
jpljpl@gmx.de

Hello,

I have some questions regarding PostgreSQL handling of Unicode databases
and their performance. I am using version 7.2.1 and running two benchmarks
against a database set up with LATIN1 encoding and the same database
with UNICODE. The database consists of a single "test" table:

Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
txt | text | not null
Primary key: test_pkey

The client is written in Java, it relies on the official JDBC driver,
and is being run on the same machine as the database.

Benchmark 1:

Insert 10,000 rows (in 10 transactions, 1000 rows per transaction)
into table "test". Each row contains 674 characters, most of which
are ASCII.

Benchmark 2:

select * from test, repeated 10 times in a loop

I am measuring the disk space taken by the database in each case
(LATIN1 vs UNICODE) and the time it takes to run the benchmarks.
I don't understand the results:

Disk space change (after inserts and vacuumdb -f):
LATIN1 UNICODE
764K 640K

I would rather assume that the Unicode database takes more space,
even 2 times as more.. Apparently not (and that's nice).

Avg. Benchmark execution times (obtained with the 'time' command, repeatedly):
Benchmark 1:
LATIN1 UNICODE
11.5s 14.5s

Benchmark 2:
LATIN1 UNICODE
4.7s 8.6s

The Unicode database is slower both on INSERTs and especially on
SELECTs. I am wondering why. Since Java uses Unicode internally,
shouldn't it actually be more efficient to store/retrieve character
data in that format, with no recoding? Maybe it is an issue with the
JDBC driver? Or is handling Unicode inherently much slower on the
backend side?

Take care -
JPL

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Ploski (#1)
Re: Unicode database + JDBC driver performance

Jan Ploski <jpljpl@gmx.de> writes:

The Unicode database is slower both on INSERTs and especially on
SELECTs.

What did you have PGCLIENTENCODING set to in each case? I am wondering
if you were incurring an extra conversion ...

regards, tom lane

#3Jan Ploski
jpljpl@gmx.de
In reply to: Tom Lane (#2)
Re: Unicode database + JDBC driver performance

On Sat, Dec 21, 2002 at 11:45:08AM -0500, Tom Lane wrote:

Jan Ploski <jpljpl@gmx.de> writes:

The Unicode database is slower both on INSERTs and especially on
SELECTs.

What did you have PGCLIENTENCODING set to in each case? I am wondering
if you were incurring an extra conversion ...

Tom,

When connecting to the Unicode database, I did not use any 'charSet'
parameter for the JDBC driver (which I assume translates to
PGCLIENTENCODING on the backend?). When I specified charSet=UNICODE,
I got an exception when executing the query. Without, it worked fine.

For the LATIN1 database, I used charSet=LATIN1 in driver's connection
string. It worked as expected (the conversion of non-ASCII characters
was not correct, but that's ok).

Is there some other way to find out what PGCLIENTENCODING is used
in each case?

Take care -
JPL

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Ploski (#3)
Re: Unicode database + JDBC driver performance

Jan Ploski <jpljpl@gmx.de> writes:

Is there some other way to find out what PGCLIENTENCODING is used
in each case?

I think "show client_encoding" and "show server_encoding" would tell
what you want (not sure how well JDBC cooperates with the SHOW command
though).

regards, tom lane

#5Jan Ploski
jpljpl@gmx.de
In reply to: Tom Lane (#4)
Re: Unicode database + JDBC driver performance

On Sat, Dec 21, 2002 at 02:57:51PM -0500, Tom Lane wrote:

Jan Ploski <jpljpl@gmx.de> writes:

Is there some other way to find out what PGCLIENTENCODING is used
in each case?

I think "show client_encoding" and "show server_encoding" would tell
what you want (not sure how well JDBC cooperates with the SHOW command
though).

Hello,

I could not get to this info via JDBC directly, but sniffing on port 5432
reveals the following output when the SHOW commands are run:

NOTICE: Current server encoding is 'UNICODE'.
NOTICE: Current client encoding is 'UNICODE'.

It seems correct to me, so the question about Unicode's slowness remains.
I guess that I will have a look at the driver's internals. Do you believe
that the backend should not be much slower [than LATIN1] when delivering
query results in the UNICODE encoding?

-JPL

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Ploski (#5)
Re: Unicode database + JDBC driver performance

Jan Ploski <jpljpl@gmx.de> writes:

It seems correct to me, so the question about Unicode's slowness remains.
I guess that I will have a look at the driver's internals. Do you believe
that the backend should not be much slower [than LATIN1] when delivering
query results in the UNICODE encoding?

If no translation is happening then I can't see a reason for any
difference at all, given what you were doing.

I can imagine speed differences for operations that actually interpret
the character data --- for example, sorting on a text column --- but
AFAICT you weren't doing anything like that.

It'd be worth checking first to see whether the extra time is really
being spent in the backend, client, or both.

regards, tom lane

#7Jan Ploski
jpljpl@gmx.de
In reply to: Tom Lane (#6)
Re: Unicode database + JDBC driver performance

On Sat, Dec 21, 2002 at 03:33:57PM -0500, Tom Lane wrote:

If no translation is happening then I can't see a reason for any
difference at all, given what you were doing.

I can imagine speed differences for operations that actually interpret
the character data --- for example, sorting on a text column --- but
AFAICT you weren't doing anything like that.

It'd be worth checking first to see whether the extra time is really
being spent in the backend, client, or both.

Hello,

Here are the results of the more detailed benchmarks.

Legend: the left column is "utime", right column is "stime" (both in
jiffies) according to "cat /proc/<pid>/stat" output for the 'postgres'
process created by establishing the database connection. For the client,
I used the Unix 'time' command (and assumed 1 jiffie == 1/100s).

In short, we can see that the backend's performance is somewhat worse
for the Unicode database in the INSERT benchmark. The client's
(driver's) performance is much worse for Unicode in this same
benchmark.

In the SELECT benchmark, the backend performance does not vary
between UNICODE and LATIN1 (nice!). The difference in client's
performance is even bigger than in the INSERT benchmark, though.

It looks like the driver needs some profiling to find out the reason.

Thanks for your assistance.

-JPL

INSERT benchmark

backend
LATIN1 UNICODE
351 102 480 134
370 112 476 164
376 126 496 178
359 125 498 126
------- -------
364 116 487 150 33% 29% worse

client
LATIN1 UNICODE
234 45 418 56
224 56 441 48
241 43 430 49
222 41 439 34
------- -------
230 46 432 47 87% 2% worse

SELECT benchmark

backend
LATIN1 UNICODE
70 9 74 15
74 19 84 14
84 12 71 11
71 13 74 14
----- -----
75 13 76 14 about the same

client
LATIN1 UNICODE
217 22 529 14
217 18 544 25
217 26 534 14
219 21 542 17
------- -------
218 22 537 18 146% worse

#8Barry Lind
barry@xythos.com
In reply to: Jan Ploski (#1)
Re: Unicode database + JDBC driver performance

Jan,

You say you are using 7.2.1, is that for both server and jdbc driver?
There is a performance patch in the 7.3 driver that bypasses the built
in java routines for converting to/from utf8 with a custom one. The
built in java routines are very slow on some jdks (although on jdk1.4
they are pretty good). Can you try the 7.3 drivers?

thanks,
--Barry

Jan Ploski wrote:

Show quoted text

Hello,

I have some questions regarding PostgreSQL handling of Unicode databases
and their performance. I am using version 7.2.1 and running two benchmarks
against a database set up with LATIN1 encoding and the same database
with UNICODE. The database consists of a single "test" table:

Column | Type | Modifiers
--------+---------+-----------
id | integer | not null
txt | text | not null
Primary key: test_pkey

The client is written in Java, it relies on the official JDBC driver,
and is being run on the same machine as the database.

Benchmark 1:

Insert 10,000 rows (in 10 transactions, 1000 rows per transaction)
into table "test". Each row contains 674 characters, most of which
are ASCII.

Benchmark 2:

select * from test, repeated 10 times in a loop

I am measuring the disk space taken by the database in each case
(LATIN1 vs UNICODE) and the time it takes to run the benchmarks.
I don't understand the results:

Disk space change (after inserts and vacuumdb -f):
LATIN1 UNICODE
764K 640K

I would rather assume that the Unicode database takes more space,
even 2 times as more.. Apparently not (and that's nice).

Avg. Benchmark execution times (obtained with the 'time' command, repeatedly):
Benchmark 1:
LATIN1 UNICODE
11.5s 14.5s

Benchmark 2:
LATIN1 UNICODE
4.7s 8.6s

The Unicode database is slower both on INSERTs and especially on
SELECTs. I am wondering why. Since Java uses Unicode internally,
shouldn't it actually be more efficient to store/retrieve character
data in that format, with no recoding? Maybe it is an issue with the
JDBC driver? Or is handling Unicode inherently much slower on the
backend side?

Take care -
JPL

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#9Jan Ploski
jpljpl@gmx.de
In reply to: Barry Lind (#8)
Re: Unicode database + JDBC driver performance

On Mon, Dec 23, 2002 at 09:52:28AM -0800, Barry Lind wrote:

Jan,

You say you are using 7.2.1, is that for both server and jdbc driver?
There is a performance patch in the 7.3 driver that bypasses the built
in java routines for converting to/from utf8 with a custom one. The
built in java routines are very slow on some jdks (although on jdk1.4
they are pretty good). Can you try the 7.3 drivers?

thanks,
--Barry

Hello,

I was using the 7.2.2 JDBC driver and 7.2.1 backend in the original
benchmark. The Java runtime version was Sun's 1.3.1. I repeated the same
tests today with the 7.3 JDBC driver. The results, together with the older
driver's figures for comparison, are pasted below. The SELECT performance
is definitely improved, but the INSERT performance got even worse.

You are right. The difference is not there when I am using Sun's 1.4.

Could it be that the improvement you mentioned does not apply there?

Take care -
JPL

SELECT benchmark

client (7.3.1 JDBC driver)
LATIN1 UNICODE
212 22 281 29
222 29 275 36
216 26 275 19
228 21 275 17
------- -------
220 25 277 25 26% worse

client (7.2.2 JDBC driver)
LATIN1 UNICODE
217 22 529 14
217 18 544 25
217 26 534 14
219 21 542 17
------- -------
218 22 537 18 146% worse

INSERT benchmark

client (7.3.1 JDBC driver)
LATIN1 UNICODE
159 40 345 49
189 36 375 26
181 37 397 44
180 36 376 36
------- -------
177 37 373 39 111% 5% worse

client (7.2.2 JDBC driver)
LATIN1 UNICODE
234 45 418 56
224 56 441 48
241 43 430 49
222 41 439 34
------- -------
230 46 432 47 87% 2% worse

#10Barry Lind
barry@xythos.com
In reply to: Jan Ploski (#9)
Re: Unicode database + JDBC driver performance

Jan,

You are correct, the 7.3 performance improvement was only implemented
for the decode method, not the encode method, so inserts will still have
the problem on certain jvms.

You can look at the 7.3 version of org.postgresql.core.Encoding to see
what the code is using to do character set conversions. If you want to
submit a patch that handles the encoding case as well, that would be great.

thanks,
--Barry

Jan Ploski wrote:

Show quoted text

On Mon, Dec 23, 2002 at 09:52:28AM -0800, Barry Lind wrote:

Jan,

You say you are using 7.2.1, is that for both server and jdbc driver?
There is a performance patch in the 7.3 driver that bypasses the built
in java routines for converting to/from utf8 with a custom one. The
built in java routines are very slow on some jdks (although on jdk1.4
they are pretty good). Can you try the 7.3 drivers?

thanks,
--Barry

Hello,

I was using the 7.2.2 JDBC driver and 7.2.1 backend in the original
benchmark. The Java runtime version was Sun's 1.3.1. I repeated the same
tests today with the 7.3 JDBC driver. The results, together with the older
driver's figures for comparison, are pasted below. The SELECT performance
is definitely improved, but the INSERT performance got even worse.

You are right. The difference is not there when I am using Sun's 1.4.

Could it be that the improvement you mentioned does not apply there?

Take care -
JPL

SELECT benchmark

client (7.3.1 JDBC driver)
LATIN1 UNICODE
212 22 281 29
222 29 275 36
216 26 275 19
228 21 275 17
------- -------
220 25 277 25 26% worse

client (7.2.2 JDBC driver)
LATIN1 UNICODE
217 22 529 14
217 18 544 25
217 26 534 14
219 21 542 17
------- -------
218 22 537 18 146% worse

INSERT benchmark

client (7.3.1 JDBC driver)
LATIN1 UNICODE
159 40 345 49
189 36 375 26
181 37 397 44
180 36 376 36
------- -------
177 37 373 39 111% 5% worse

client (7.2.2 JDBC driver)
LATIN1 UNICODE
234 45 418 56
224 56 441 48
241 43 430 49
222 41 439 34
------- -------
230 46 432 47 87% 2% worse

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#11Jan Ploski
jpljpl@gmx.de
In reply to: Barry Lind (#10)
Re: Unicode database + JDBC driver performance

On Mon, Dec 30, 2002 at 09:10:24AM -0800, Barry Lind wrote:

Jan,

You are correct, the 7.3 performance improvement was only implemented
for the decode method, not the encode method, so inserts will still have
the problem on certain jvms.

You can look at the 7.3 version of org.postgresql.core.Encoding to see
what the code is using to do character set conversions. If you want to
submit a patch that handles the encoding case as well, that would be great.

Hello,

I wrote an optimized UTF8Encoder class, which I will send to you
in private. To finish this thread and for the archive, here are my
benchmark's results with the improved version, compared to those with
the original 7.3.1 driver (again, JDK 1.3.1 + Linux).

Take care -
JPL

INSERT benchmark

client (7.3.1 JDBC driver, old)
LATIN1 UNICODE
159 40 345 49
189 36 375 26
181 37 397 44
180 36 376 36
------- -------
177 37 373 39 111% 5% worse

client (7.3.1 JDBC driver, new)

LATIN1 UNICODE
209 27 162 30
210 37 194 27
178 33 163 28
167 29 186 33
------- -------
191 32 176 30 7% 6% better :)

#12Jan Ploski
jpljpl@gmx.de
In reply to: Jan Ploski (#11)
Re: Unicode database + JDBC driver performance

Hello,

Here is my UTF8Encoder class mentioned on pgsql-general.

It should be put into org/postgresql/core, and you will also
need to patch Encoding.java, so that it uses this class:

if (encoding.equals("UTF-8")) {
return UTF8Encoder.encode2(s);
}
else {
return s.getBytes(encoding);
}

There are two public utility methods in UTF8Encoder, encode1 and encode2.
They use two different approaches to determining how big the output
buffer should be. Performance-wise they seem very similiar (encode2
being a bit slower), but I favor encode2 because it does less memory
allocation and copying.

Note that I did not use any shared buffer in order to avoid
synchronization of multiple threads (as I understand, the class
Encoding must ensure thread safety itself). This may be an unnecessary
concern after all... I don't know.

UTF8Encoder can be used as is, or made into a private static inner
class of Encoding.java, whatever you prefer.

UTF8Encoder.main contains some tests to assert that it stays compatible
with Java's built-in encoder. It may be nicer to move them into
a JUnit test case, you decide.

Take care -
JPL

Attachments:

UTF8Encoder.tar.gzapplication/octet-stream; name=UTF8Encoder.tar.gzDownload