niceideas.ch
Technological Thoughts by Jerome Kehrli

Java rocks !

by Jerome Kehrli


Posted on Wednesday Nov 03, 2010 at 08:40AM in Java


I've been facing an interesting problem with string manipulation in Java lately at work. The requirement was the following :

We have a field on some screen where the user can type in a comment. The comment can have any length the user wants, absolutely any. Should he want to type in a comment of a million characters, he should be able to do so.

Now the right way to store this comment in a database is using a CLOB, a BLOB or a LONGVARCHAR or whatever feature the database natively provides to do so. Unfortunately that's not the way it was designed. Due to legacy integration needs, all these advance DB types are prohibited within our application. So the way we have to store the comment consists of using several rows with a single comment field of a maximum length of 500 characters. That means the long comment has to be split in several sub-strings of 500 characters and each of them is stored in a separate row in the DB table. The table has a counter as part of the primary key which is incremented for each new row belonging to the same comment. This way we can easily spot every row part of the same comment.

Now another problem we have is that under DB2 a field defined as VARCHAR(500) can contain 500 bytes max even though the strings are encoded in UTF-8 in the database. That means we might not be able to store 500 characters if the string contains one or more 2 bytes UTF-8 characters. Working in a french environment, this happens a lot.
So we had to write a little algorithm taking care of the splitting of the string in 500 bytes sub-strings.

The very first version of our algorithm was quite stupid and ended up in splitting the string in a quite naive way: we converted the string to a byte array following an UTF-8 encoding and split the byte array instead of the string. Then each of the 500 bytes arrays was converted back to a string before being inserted in the database.
Happily, we figured quite soon that this doesn't work as it ends up quite often splitting the string right in the middle of a 2 bytes character. The byte arrays being then converted back to strings, the split 2 bytes character was corrupted and could not be corrected any more.

Before writing as smarter version of the algorithm which would manually test the byte length of the character right at the position of the split, we took a leap backward and wondered : "Can it be that Java doesn't offer natively a simple way to do just that ?"

And the answer is yes of course.

The solution with Java is quite straightforward :

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CoderResult;
import java.util.ArrayList;
import java.util.List;

public class StringSplitter {

    public static List<String> splitString500(String value) throws Exception {
        List<String> retList = new ArrayList<String>();
        try {

            Charset utf8CSet = Charset.forName("UTF-8");
            CharsetEncoder enc = utf8CSet.newEncoder();
            CharsetDecoder decoder = utf8CSet.newDecoder();

            CharBuffer cBuffer = CharBuffer.wrap(value);
            ByteBuffer bBuffer = ByteBuffer.allocate(500);
            CoderResult cr = null;
            do {
                cr = enc.encode(cBuffer, bBuffer, false);
                retList.add(decoder.decode((ByteBuffer) bBuffer.flip()).toString());
                bBuffer.rewind();
            } while (cr != CoderResult.UNDERFLOW);

        } catch (CharacterCodingException e) {
            e.printStackTrace();
            throw new Exception(e);
        }
        return retList;
    }
}

This piece of code is interesting as it underlines two things :

  1. Java is definitely a language making string manipulation as simple as possible. There are hundreds of classes aiming at simplifying the overwhelming complexity of string representation, encoding, decoding, etc. Once again Java rocks.
  2. Despite working for almost 8 years everyday with Java technologies, I discover new classes and new stuff even in the standard Java libraries everyday, not speaking about the dozens of Java related technologies and libraries that flourish everyday. This is a good example as I was really about to code a piece of code on my own before actually finding out the JDK already provides such a feature.

Well that was sort of a good lesson of humility. Hackers tend to limit themselves to the set of things they know of the technology and consider everything else just doesn't exist and has to be done on their own while actually the very feature they are looking for might well already be implemented. One simply need to open his eyes.



Comments:

ya java rocks every time it gives out solution to any problem

Posted by neo on November 09, 2010 at 11:56 AM CET #


Really cool article on java java really rocks

Posted by anehra63 on November 10, 2010 at 10:30 AM CET #


It's you, dude rocks!

Posted by Nathanael Yang on January 14, 2019 at 08:31 PM CET #


When I try with test data as "ABCकखगघङD" instead of splitting into 2 tokens, the current code splitted into 3. Output [ABCकख, गघङ, D] But the last character D is single byte char and can fit into second token.

Posted by Venkat on June 05, 2021 at 02:13 PM CEST #


missed to tell that I used buffer size of 10 instead of 500

Posted by Venkat on June 05, 2021 at 02:18 PM CEST #


Figured out the issue... we need to reset the limit after rewind method as below, else if any time because of character limit if the read limit is 9, then it try to use same for next flow.
bBuffer.limit(byteLimit);

Posted by Venkat on June 05, 2021 at 05:05 PM CEST #


Leave a Comment

HTML Syntax: Allowed