VARCHAR(500) can contain 500 bytes max even though the strings are encoded in UTF-8 in the database. That means we might not be able to store 500 characters if the string contains one or more 2 bytes UTF-8 characters. Working in a french environment, this happens a lot. So we had to write a little algorithm taking care of the splitting of the string in 500 bytes sub-strings. The very first version of our algorithm was quite stupid and ended up in splitting the string in a quite naive way: we converted the string to a byte array following an UTF-8 encoding and split the byte array instead of the string. Then each of the 500 bytes arrays was converted back to a string before being inserted in the database.
Happily, we figured out quite soon that this doesn't work as it ends up quite often splitting the string right in the middle of a 2 bytes character. The byte arrays being then converted back to strings, the split 2 bytes character was corrupted and could not be corrected any more. Before writing as smarter version of the algorithm which would manually test the byte length of the character right at the position of the split, we took a leap backward and wondered : "Can it be that Java doesn't offer natively a simple way to do just that ?" And the answer is yes of course. [Read More]