niceideas.ch: Technological Thoughts by Jerome Kehrli

Entries tagged [character-encoding]

Java rocks !

by Jerome Kehrli

Posted on Wednesday Nov 03, 2010 at 08:40AM in Java

I've been facing an interesting problem with string manipulation in Java lately at work. The requirement was the following :

We have a field on some screen where the user can type in a comment. The comment can have any length the user wants, absolutely any. Should he want to type in a comment of a million characters, he should be able to do so.

Now the right way to store this comment in a database is using a CLOB, a BLOB or a LONGVARCHAR or whatever feature the database natively provides to do so. Unfortunately that's not the way it was designed. Due to legacy integration needs, all these advance DB types are prohibited within our application. So the way we have to store the comment consists of using several rows with a single comment field of a maximum length of 500 characters. That means the long comment has to be split in several sub-strings of 500 characters and each of them is stored in a separate row in the DB table. The table has a counter as part of the primary key which is incremented for each new row belonging to the same comment. This way we can easily spot every row part of the same comment.

Now another problem we have is that under DB2 a field defined as VARCHAR(500) can contain 500 bytes max even though the strings are encoded in UTF-8 in the database. That means we might not be able to store 500 characters if the string contains one or more 2 bytes UTF-8 characters. Working in a french environment, this happens a lot.
So we had to write a little algorithm taking care of the splitting of the string in 500 bytes sub-strings.

The very first version of our algorithm was quite stupid and ended up in splitting the string in a quite naive way: we converted the string to a byte array following an UTF-8 encoding and split the byte array instead of the string. Then each of the 500 bytes arrays was converted back to a string before being inserted in the database.
Happily, we figured out quite soon that this doesn't work as it ends up quite often splitting the string right in the middle of a 2 bytes character. The byte arrays being then converted back to strings, the split 2 bytes character was corrupted and could not be corrected any more.

Before writing as smarter version of the algorithm which would manually test the byte length of the character right at the position of the split, we took a leap backward and wondered : "Can it be that Java doesn't offer natively a simple way to do just that ?"

And the answer is yes of course.

Comments [6]

Tags: byte-array character-decoding character-encoding java string

Hot tags

Below are the most often used tags in the blog. Hover over a tag to see a count of entries, click a tag to see the most recent posts with the tag.