Class Utf8
There are several variants of UTF-8. The one implemented by this class is the restricted definition of UTF-8 introduced in Unicode 3.1, which mandates the rejection of "overlong" byte sequences as well as rejection of 3-byte surrogate codepoint byte sequences. Note that the UTF-8 decoder included in Oracle's JDK has been modified to also reject "overlong" byte sequences, but (as of 2011) still accepts 3-byte surrogate codepoint byte sequences.
The byte sequences considered valid by this class are exactly those that can be roundtrip converted to Strings and back to bytes using the UTF-8 charset, without loss:
Arrays.equals(bytes, new String(bytes, Internal.UTF_8).getBytes(Internal.UTF_8))
See the Unicode Standard, Table 3-6. UTF-8 Bit Distribution, Table 3-7. Well Formed UTF-8 Byte Sequences.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprivate static classUtility methods for decoding bytes intoString.(package private) static classA processor of UTF-8 strings, providing methods for checking validity and encoding.(package private) static final classUtf8.Processorimplementation that does not use anysun.misc.Unsafemethods.private static class(package private) static final classUtf8.Processorthat usessun.misc.Unsafewhere possible to improve performance. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final longA mask used when performing unsafe reads to determine if a long value contains any non-ASCII characters (i.e.(package private) static final intMaximum number of bytes per Java UTF-16 char in UTF-8.private static final Utf8.ProcessorUTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations depending on what is available on the platform.private static final intUsed byUnsafeUTF-8 string validation logic to determine the minimum string length above which to employ an optimized algorithm for counting ASCII characters. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) static StringdecodeUtf8(byte[] bytes, int index, int size) Decodes the given UTF-8 encoded byte array slice into aString.(package private) static StringdecodeUtf8(ByteBuffer buffer, int index, int size) Decodes the given UTF-8 portion of theByteBufferinto aString.(package private) static int(package private) static intencodedLength(String string) Returns the number of bytes in the UTF-8-encoded form ofsequence.private static intencodedLengthGeneral(String string, int start) (package private) static voidencodeUtf8(String in, ByteBuffer out) Encodes the given characters to the targetByteBufferusing UTF-8 encoding.private static intestimateConsecutiveAscii(ByteBuffer buffer, int index, int limit) Counts (approximately) the number of consecutive ASCII characters in the given buffer.(package private) static booleanisValidUtf8(byte[] bytes) Returnstrueif the given byte array is a well-formed UTF-8 byte sequence.(package private) static booleanisValidUtf8(byte[] bytes, int index, int limit) Returnstrueif the given byte array slice is a well-formed UTF-8 byte sequence.(package private) static booleanisValidUtf8(ByteBuffer buffer) Determines if the givenByteBufferis a valid UTF-8 string.
-
Field Details
-
processor
UTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations depending on what is available on the platform. The processor is the platform-optimized delegate for which all methods are delegated directly to. -
ASCII_MASK_LONG
private static final long ASCII_MASK_LONGA mask used when performing unsafe reads to determine if a long value contains any non-ASCII characters (i.e. any byte >= 0x80).- See Also:
-
MAX_BYTES_PER_CHAR
static final int MAX_BYTES_PER_CHARMaximum number of bytes per Java UTF-16 char in UTF-8.- See Also:
-
UNSAFE_COUNT_ASCII_THRESHOLD
private static final int UNSAFE_COUNT_ASCII_THRESHOLDUsed byUnsafeUTF-8 string validation logic to determine the minimum string length above which to employ an optimized algorithm for counting ASCII characters. The reason for this threshold is that for small strings, the optimization may not be beneficial or may even negatively impact performance since it requires additional logic to avoid unaligned reads (when callingUnsafe.getLong). This threshold guarantees that even if the initial offset is unaligned, we're guaranteed to make at least one call toUnsafe.getLong()which provides a performance improvement that entirely subsumes the cost of the additional logic.- See Also:
-
-
Constructor Details
-
Utf8
private Utf8()
-
-
Method Details
-
isValidUtf8
static boolean isValidUtf8(byte[] bytes) Returnstrueif the given byte array is a well-formed UTF-8 byte sequence.This is a convenience method, equivalent to a call to
isValidUtf8(bytes, 0, bytes.length). -
isValidUtf8
static boolean isValidUtf8(byte[] bytes, int index, int limit) Returnstrueif the given byte array slice is a well-formed UTF-8 byte sequence. The range of bytes to be checked extends from indexindex, inclusive, tolimit, exclusive. -
isValidUtf8
Determines if the givenByteBufferis a valid UTF-8 string.Selects an optimal algorithm based on the type of
ByteBuffer(i.e. heap or direct) and the capabilities of the platform.- Parameters:
buffer- the buffer to check.- See Also:
-
encodedLength
Returns the number of bytes in the UTF-8-encoded form ofsequence. For a string, this method is equivalent tostring.getBytes(UTF_8).length, but is more efficient in both time and space. -
encodedLengthGeneral
private static int encodedLengthGeneral(String string, int start) throws Utf8.UnpairedSurrogateException - Throws:
Utf8.UnpairedSurrogateException
-
encode
-
decodeUtf8
static String decodeUtf8(ByteBuffer buffer, int index, int size) throws InvalidProtocolBufferException Decodes the given UTF-8 portion of theByteBufferinto aString.- Throws:
InvalidProtocolBufferException- if the input is not valid UTF-8.
-
decodeUtf8
Decodes the given UTF-8 encoded byte array slice into aString.- Throws:
InvalidProtocolBufferException- if the input is not valid UTF-8.
-
encodeUtf8
Encodes the given characters to the targetByteBufferusing UTF-8 encoding.Selects an optimal algorithm based on the type of
ByteBuffer(i.e. heap or direct) and the capabilities of the platform.- Parameters:
in- the source string to be encodedout- the target buffer to receive the encoded string.- See Also:
-
estimateConsecutiveAscii
Counts (approximately) the number of consecutive ASCII characters in the given buffer. The byte order of theByteBufferdoes not matter, so performance can be improved if native byte order is used (i.e. no byte-swapping inByteBuffer.getLong(int)).- Parameters:
buffer- the buffer to be scanned for ASCII charsindex- the starting index of the scanlimit- the limit within buffer for the scan- Returns:
- the number of ASCII characters found. The stopping position will be at or before the first non-ASCII byte.
-