Wednesday, August 31, 2011

C++: Unicode

Character sets
http://www.microsoft.com/typography/unicode/cs.htm



Support for Unicode
http://msdn.microsoft.com/en-us/library/2dax2h36.aspx

Unicode is a specification for supporting all character sets, including character sets that cannot be represented in a single byte. If you are programming for an international market, consider using either Unicode or multibyte character sets (MBCSs) or enabling your program so you can build it for either by changing a switch.

A wide character is a 2-byte multilingual character code. Most characters used in modern computing worldwide, including technical symbols and special publishing characters, can be represented according to the Unicode specification as a wide character. Characters that cannot be represented in 1 wide character can be represented in a Unicode pair with Unicode's surrogate feature. Because each wide character is always represented in a fixed size of 16 bits, using wide characters simplifies programming with international character sets.

Generally, wide characters take more space in memory than multibyte characters but are faster to process. In addition, only one locale can be represented at a time in multibyte encoding, whereas all character sets in the world are represented simultaneously by the Unicode representation.



Unicode Programming Summary
http://msdn.microsoft.com/en-us/library/dybsewaf%28VS.80%29.aspx

With _UNICODE defined, _T translates the literal string to the L-prefixed form; otherwise, _T translates the string without the L prefix.



Multibyte Character Sets
http://msdn.microsoft.com/en-us/library/5z097dxa.aspx



Windows ANSI character set

C++ : Literals

http://msdn.microsoft.com/en-us/library/c70dax92%28v=vs.80%29.aspx

Invariant program elements are called "literals" or "constants." The terms "literal" and "constant" are used interchangeably here. Literals fall into four major categories: integer, character, floating-point, and string literals.