The Source for Java Technology Collaboration

Home » java.net Forums » Java Web Services and XML » Java WS & XML Community News

Thread: Any known issues with 4-byte utf-8 characters and JAX-WS?

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
This question is not answered. Helpful answers available: 2. Correct answers available: 1.

Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 3 - Last Post: Jul 30, 2007 3:52 PM by: justinlindh
chriscorbell

Posts: 1
Any known issues with 4-byte utf-8 characters and JAX-WS?
Posted: Jul 19, 2007 2:26 PM
 
  Click to reply to this thread Reply

I have a webservice hosted in JBoss and recently upgraded to JAX-WS from JAX-RPC.

Everything's working well except a bug has appeared which wasn't there under JAX-RPC when a UTF-8-encoded 4-byte (e.g. Japanese) character is in the SOAP message body. The server returns a "Bad request" fault, somewhere early in the stack.

Is there any known issue with 4-byte utf-8 characters and JAX-WS? The byte sequence of the character I'm using to test is F0 A6 9F 8C.

The character (assuming it renders correctly here) is 𦟌. It occurs in the text content of an element in the SOAP body (not in an attribute value or identifier).

TIA,
Chris

joconner

Posts: 4
Re: Any known issues with 4-byte utf-8 characters and JAX-WS?
Posted: Jul 20, 2007 12:07 PM   in response to: chriscorbell
 
  Click to reply to this thread Reply

Most UTF-8 encoded Japanese characters will encode in 3 bytes. For example, the character æ¼¢ (KAN) encodes as the three UTF-8 code units E6 BC A2. If you have Japanese characters that encode as four UTF-8 code units, you must be using characters above the base multilingual plane (supplementary characters). Maybe you are encoding the characters incorrectly? Are you really using supplementary characters?

Regards,
John O'Conner

joconner

Posts: 4
Re: Any known issues with 4-byte utf-8 characters and JAX-WS?
Posted: Jul 20, 2007 7:58 PM   in response to: chriscorbell
 
  Click to reply to this thread Reply

Converting your UTF-8 to a Unicode code point value, I get U+267CC, definitely in the supplementary area. It is a completely valid Unicode character, supported nicely in Java SE 5 and higher. What version of Java are you using? Maybe you are using an older version of the Java platform, one that doesn't quite grok the character. Can you try a slightly less ambitious character, perhaps one up to U+FFFF...let's see how your app works then, and then we'll re-evaluate the problem.

Regards,
John O'Conner
http://joconner.com

justinlindh

Posts: 4
Re: Any known issues with 4-byte utf-8 characters and JAX-WS?
Posted: Jul 30, 2007 3:52 PM   in response to: joconner
 
  Click to reply to this thread Reply

I'm also having some problems that sound similar to this.

I'm submitting data from a web page, and when I use Japanese characters I'm seeing the following received in the debugger:
\u0006F22\u0005B57 (for: 漢字)

This is UTF-16, but I'm sending this data to a dotnet application that is expecting UTF-8. How can I do this conversion? I've tried:
String utf8Body = new String(request.getBody().getBytes(), "UTF-8");

But this only serves to mangle the String once received. I'm new to internationalization issues, so any help is greatly appreciated.




 XML java.net RSS