Wednesday, March 2, 2011

Java String Encoding

I think intro Java courses should discuss coding for an international community - specifically encoding/charset, locale (language/country), and timezone. Since creating Strings, reading and writing Files, and creating Dates are fundamental.
And its become increasingly apparent to me how common it is to assume that the default OS/system values are fine.

When working on the Maven Java formatter plugin, I began considering system dependent properties of Java source files, such as line endings and then encoding.
I found an encoding convention for Maven plugins to follow, reading and writing the source files using a defined encoding.

When looking to build the OpenMRS project, I found the build was failing, apparently only on my local (Windows) machine.
A test source file was recently added with UTF-8 encoding, including "foreign text".
The first failure was the compiler plugin, since its version needed to be updated to support the encoding property.
The second failure was an assertion in this test failing. The "foreign text" was being handled using the default charset of the platform, using String(byte[]) and String.getBytes(). This is a case where the CI server (Linux) didn't help to keep the build stable. I ended up submitting a patch to use String methods/constructor supporting encoding.

In the Java Internationalization FAQ it lists examples where the UTF-8 encoding cannot be used including:
"When writing plain text files that will be interpreted in the host OS's default encoding, if that encoding is not UTF-8."
This is exactly the case I was seeing. The UTF-8 file and text was being read and compared using the default encoding for Windows, Cp1252.

This is just a recent example but there are numerous others, and I think CS students should know about these types of problems early on.

No comments:

Post a Comment