| Index: installer/src/documentation/localization.dox | 
| =================================================================== | 
| new file mode 100644 | 
| --- /dev/null | 
| +++ b/installer/src/documentation/localization.dox | 
| @@ -0,0 +1,219 @@ | 
| +/*! | 
| + | 
| +\page localization Localization | 
| + | 
| +The Windows Installer originally dates from 1999, and it hasn't kept up with the times. | 
| +Most problematic is that as of this writing [2013 Nov] it doesn't have official support for Unicode. | 
| +Support is largely undocumented, and in any case is only partial. | 
| +This causes all sorts of problems with modern tools, many of which default to UTF-8. | 
| +WiX, as XML, does not exactly default to UTF-8, but it's close to it by convention. | 
| +Some languages are Unicode-only, notably those based on Indic scripts. | 
| + | 
| +Overall, these limitation generate two kinds of issues. | 
| +The first class is deficits in the shipped product. | 
| +In short, some strings in certain languages simply cannot be localized. | 
| +The second class is a burden on compliance testing before release. | 
| +Unicode support is not official in Windows Installer; indeed it's only partial support at that. | 
| +But also the WiX toolset does not provide a complete set of assurance tools, | 
| +  even though all the necessary compiler mechanisms seem to be present. | 
| +As a result, manual verification of localization must occur for each language. | 
| + | 
| +\par MSI format | 
| + | 
| +The format of Windows Installer affects these issues. | 
| +While it's a proprietary format, it's not totally opaque. | 
| +Overall, it's uses the [COM Structured Storage][structured_storage] format, | 
| +  which is just an archive file with an internal file system. | 
| +In their lingo, a "storage" is a directory and "stream" is a file. | 
| +Two important aspects of localization use mechanisms whose documentation uses these terms. | 
| +The "Summary Information Stream" is a file that provides metadata for the installer database. | 
| +Embedded language transforms sit in the final MSI as storages. | 
| +Some commentary on this topic is available on Rob Mensching's blog; | 
| +  he's the principal author of WiX and a former member of the Windows Installer team. | 
| +- Mensching | 
| +  [Inside the MSI file format] (http://robmensching.com/blog/posts/2003/11/25/inside-the-msi-file-format) | 
| +- Mensching | 
| +  [Inside the MSI file format, again] (http://robmensching.com/blog/posts/2004/2/10/inside-the-msi-file-format-again) | 
| + | 
| +[structured_storage]: http://msdn.microsoft.com/en-us/library/aa380369%28VS.85%29.aspx | 
| + | 
| +The Summary Information Stream is an important structure for two reasons. | 
| +The first is that it needs to be localized. | 
| +It contains the strings that appear in the Control Panel tool that lists all installed programs. | 
| +This includes basic things like the name of the program. | 
| +The second is that one of its fields lists the languages supported in a multiple-language installer. | 
| +- MSDN | 
| +  [Summary Property Descriptions] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa372049%28v=vs.85%29.aspx) | 
| + | 
| +It's worth noting that these summary properties were not originally designed for installation. | 
| +Rather, they are derived from the summary information for documents defined by the COM structured storage interface. | 
| +The names originally used for these fields were carried over to the SIS for installers, | 
| +    even when they didn't fit at all, | 
| +One example is use of the page count field to represent the minimum required installer version. | 
| +Likewise, certain fields were dropped entirely, such as "total editing time". | 
| +- MSDN | 
| +  [Summary Information Property Set] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa380376%28v=vs.85%29.aspx) | 
| +  This page is about COM Structured Storage, not Windows Installer. | 
| + | 
| +The tool OpenMCDF, available on sourceforge, is a .NET component to manipulate structured storage. | 
| +It contains a simple structured storage browser that can show if embedded transforms are present. | 
| +- SourceForge | 
| +  [OpenMCDF] (http://sourceforge.net/projects/openmcdf/) | 
| + | 
| + | 
| +\par Installation Strings | 
| + | 
| +A simple MSI contains two internal files that contain localizable strings. | 
| +- An installation database itself. | 
| +  This is a set of tables, some of which contain properties and UI strings. | 
| +- The Summary Information Stream. | 
| +  This data appears principally in the Add/Remove Programs Menu within the Control Panel. | 
| +We're also using embedded directories (storages) to store language-specific transforms. | 
| +These transforms, however, are generated from simple MSI's that contain only the two files above. | 
| + | 
| +These two files are localized differently and have different levels of support for Unicode. | 
| +In short, the installer database mostly supports Unicode and the SIS does not. | 
| +Unicode is not officially supported anywhere, but it seems to work in practice for installer databases. | 
| +The SIS is another story. | 
| +Trying to use Unicode in the SIS leads to Windows thinking the MSI is malformed and refusing to install it. | 
| +The SIS can be localized into many languages, but not all; see the section on code pages below. | 
| + | 
| + | 
| +\par Language Identifiers | 
| + | 
| +Microsoft has its own system of identifying languages. | 
| +(Could it have been otherwise?) | 
| +A language is identified with an LCID, which variously stands for language code identifier or locale identifier. | 
| +The inconsistency is internal to Microsoft, because it itself doesn't have a completely consistent usage of LCID. | 
| +The earlier LCID is the one accepted by Windows system calls; | 
| +  the later one is used for the .NET library. | 
| +They are mostly the same, but they're not identical. | 
| + | 
| +Windows installer does not itself use any of the textual identifiers that are available, | 
| +  such as ISO 693 or its IETF usage. | 
| +WiX has some support for textual identifiers, but doesn't document how LCID's are generated from them. | 
| +As a result, WiX should be considered unreliable for this. | 
| +LCID reference tables are usually specified as hexadecimal constant, | 
| +  since the LCID itself is a bit field. | 
| +Infuriatingly, WiX doesn't support hexadecimal integer constants, | 
| +  requiring a conversion of the LCID to decimal. | 
| + | 
| +The notion of a character set is given by a code page, | 
| +  which is simply an assignment of character strings to glyph strings. | 
| +It's not as simple as a character-to-glyph map, | 
| +  since a double-byte character set (DBCS) uses an internal escape code | 
| +  and mixes single- and double-byte characters. | 
| +Regardless of these details, the important distinction is between the so-called "ANSI code pages" and Unicode. | 
| + | 
| +- MSDN | 
| +  [Language Identifiers] (http://msdn.microsoft.com/en-us/library/dd318691%28v=vs.85%29.aspx) | 
| +  The format of LCID as the operating system recognizes them. | 
| +- MSDN | 
| +  [Language Identifier Constants and Strings] (http://msdn.microsoft.com/en-us/library/dd318693%28v=vs.85%29.aspx). | 
| +  Note: This is for the .NET version, but contains enough information to derive the Window system call version. | 
| + | 
| + | 
| +\par Code Pages | 
| + | 
| +The SIS is the only place where code pages really matter, because the SIS doesn't support Unicode. | 
| + | 
| +Microsoft has its own system of character sets, called code pages. | 
| +Code pages predate Unicode, and while Microsoft is shifting over to Unicode, the transition isn't complete. | 
| +As long as the Windows Installer doesn't fully support Unicode, code pages will be relevant. | 
| +Even afterwards, they'll remain relevant until the first working version is the oldest one requiring support. | 
| + | 
| +A code page specifies an encoding from byte strings to character strings. | 
| +The details aren't particularly relevant to installation issues. | 
| +What matters are the identifiers. | 
| +With Unicode, you can settle on a single identifier, "UTF-8", that represents an encoding. | 
| +Code page identifiers are 16-bit integers. | 
| +The most common one is 1252 (Latin-1), used for English and most Western European languages. | 
| +The code page identifier 65001 means UTF-8, but it's not universally supported. | 
| + | 
| +Even within the Microsoft environment, code page identifiers are not completely consistent. | 
| +The Windows operating system accepts a basic set of code pages; | 
| +  overall these are alphabetic writing systems. | 
| +There are also OEM code pages and non-native code pages. | 
| +Code pages are used by Windows for font selection (amongst other things). | 
| +Either or both of the code page itself and the requisite display fonts might be absent on a machine. | 
| +Furthermore, .NET introduced some others, such one for UTF-16LE, | 
| +  but these are only available to managed applications. | 
| +Luckily, there don't seem to be any multiply-assigned code page identifiers. | 
| +All this matters because various tables of code page identifiers tend not to distinguish clearly between these categories. | 
| +Even armed with a table, the developer or a linguist should be cautious about selecting a code page. | 
| + | 
| +The SIS uses, however, so-called "ANSI code pages", sometimes called "Windows code pages". | 
| +This doesn't seem to be particularly well-defined term. | 
| +It seems to mean "acceptable to a system call ending with A and not W", | 
| +  but even that's not clear. | 
| +What is clear is that even though 65001 for UTF-8 appears in some tables of ANSI code pages, | 
| +  it isn't one insofar as Windows Installer is concerned. | 
| +Caveat emptor, indeed. | 
| + | 
| +- MSDN | 
| +  [Code Pages] (http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752%28v=vs.85%29.aspx) | 
| +  Commentary and historical information. | 
| +- Wikipedia | 
| +  [Code page] (http://en.wikipedia.org/wiki/Code_page) | 
| +  Do not take as authoritative, but nevertheless useful for understanding some things you might see elsewhere. | 
| +- MSDN | 
| +  [Code Page Identifiers] (http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx) | 
| +- Heath Stewart's blog | 
| +  [MSI Databases and Code Pages] (http://blogs.msdn.com/b/heaths/archive/2005/10/05/msi-databases-and-code-pages.aspx) | 
| +  Explains how older tools dealt with this issue; WiX is much easier. | 
| +  Contains some useful background information. | 
| +- Mailing list wix-users | 
| +  [Build time selection of codepage...] (http://comments.gmane.org/gmane.comp.windows.devel.wix.user/44388) | 
| +  One developer decided not to localize the SIS at all. | 
| +- [Character Sets And Code Pages At The Push Of A Button] (http://www.i18nguy.com/unicode/codepages.html) | 
| +  A wealth of information about the specific code page and other encodings. | 
| +  Hopefully we never need to dig this far in. | 
| + | 
| + | 
| +\par Multiple-language Installers | 
| + | 
| +The Windows Installer does not have good mechanisms for multiple languages. | 
| +Each installation package is a single-language installer at the time of installation. | 
| +In some situations it's feasible to simply ship out single-language installers, | 
| +  such as internal deployment within a multinational company. | 
| +In others, where multiple languages are required, this leads to bloat because of duplication of installation assets. | 
| + | 
| +The documented way to support this situation is to deliver two assets: | 
| +  (1) a basic MSI and (2) a transform (MST) that changes the language of the MSI. | 
| +There's syntax on the `msiexec` command line to apply a transform before installation. | 
| +For consumer software, this is too much, so people started using an installation driver executable. | 
| +This executable would typically present a menu for the locale, | 
| +  or sometimes infer it from the environment. | 
| +After selecting a locale, it would select and apply an appropriate transform. | 
| +This is the documented way of doing things. | 
| + | 
| +There is, however, an undocumented way to automatically apply a transform. | 
| +The Summary Information has a property called `Template` that contains a list of languages supported. | 
| +These languages are specified by a language identifer (LCID), a decimal integer, | 
| +  whose sublanguage may either be generic (zero) or specific. | 
| +For example, 1033 is US English. | 
| +The first (or only) language on the list specifies the language of the non-transformed installation database. | 
| +Each subsequent identifier specifies an additional language. | 
| +The MSI must contain an embedded transform for each additional language. | 
| +(If it doens't, Windows Installer throws up an error message and aborts installation.) | 
| +An embedded transform is an MST file in a substorage (directory) whose name is the decimal LCID. | 
| +Windows Installer checks the language list before installation proper begins to determine the most appropriate language. | 
| +If finds one that's not first on the list, it applies its embedded transform before installation proper begins. | 
| +Otherwise it uses the installer database as-is, without any transformation. | 
| + | 
| +- installsite.org | 
| +  [Multi-Language MSI Packages without Setup.exe Launcher] (http://www.installsite.org/pages/en/msi/articles/embeddedlang/) | 
| +  The original page that documents the automatic application of embedded language transforms. | 
| +  First written by Andreas Kerl at Microsoft Germany and then translated into English. | 
| +- MSDN | 
| +  [Template Summary property] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa372070%28v=vs.85%29.aspx) | 
| +  Documents the format of the Template property in the SIS. | 
| +  From this page: "Merge Modules are the only packages that may have multiple languages." | 
| +  Contrary to this quotation, MSI files are also allowed to have multiple languages, although such support is undocumented. | 
| +- MSDN | 
| +  [Embedded Transforms] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa368347%28v=vs.85%29.aspx) | 
| +  This page not only isn't very informative, it's also slightly wrong. | 
| +  There's an implication that an embedded transform is stored as a file, which is only sort-of right; | 
| +    the actual mechanism is as a substorage (directory). | 
| + | 
| +*/ | 
|  |