VARCHAR(500) can contain 500 bytes max even though the strings are encoded in UTF-8 in the database. That means we might not be able to store 500 characters if the string contains one or more 2 bytes UTF-8 characters. Working in a french environment, this happens a lot. So we had to write a little algorithm taking care of the splitting of the string in 500 bytes sub-strings. The very first version of our algorithm was quite stupid and ended up in splitting the string in a quite naive way: we converted the string to a byte array following an UTF-8 encoding and split the byte array instead of the string. Then each of the 500 bytes arrays was converted back to a string before being inserted in the database.
Happily, we figured quite soon that this doesn't work as it ends up quite often splitting the string right in the middle of a 2 bytes character. The byte arrays being then converted back to strings, the split 2 bytes character was corrupted and could not be corrected any more. Before writing as smarter version of the algorithm which would manually test the byte length of the character right at the position of the split, we took a leap backward and wondered : "Can it be that Java doesn't offer natively a simple way to do just that ?" And the answer is yes of course. The solution with Java is quite straightforward :
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CoderResult;
import java.util.ArrayList;
import java.util.List;
public class StringSplitter {
public static List<String> splitString500(String value) throws Exception {
List<String> retList = new ArrayList<String>();
try {
Charset utf8CSet = Charset.forName("UTF-8");
CharsetEncoder enc = utf8CSet.newEncoder();
CharsetDecoder decoder = utf8CSet.newDecoder();
CharBuffer cBuffer = CharBuffer.wrap(value);
ByteBuffer bBuffer = ByteBuffer.allocate(500);
CoderResult cr = null;
do {
cr = enc.encode(cBuffer, bBuffer, false);
retList.add(decoder.decode((ByteBuffer) bBuffer.flip()).toString());
bBuffer.rewind();
} while (cr != CoderResult.UNDERFLOW);
} catch (CharacterCodingException e) {
e.printStackTrace();
throw new Exception(e);
}
return retList;
}
}
This piece of code is interesting as it underlines two things :
- Java is definitely a language making string manipulation as simple as possible. There are hundreds of classes aiming at simplifying the overwhelming complexity of string representation, encoding, decoding, etc. Once again Java rocks.
- Despite working for almost 8 years everyday with Java technologies, I discover new classes and new stuff even in the standard Java libraries everyday, not speaking about the dozens of Java related technologies and libraries that flourish everyday. This is a good example as I was really about to code a piece of code on my own before actually finding out the JDK already provides such a feature.
Posted by neo on novembre 09, 2010 at 11:56 AM CET #
Posted by anehra63 on novembre 10, 2010 at 10:30 AM CET #