| Index: installer/src/documentation/localization.dox |
| =================================================================== |
| new file mode 100644 |
| --- /dev/null |
| +++ b/installer/src/documentation/localization.dox |
| @@ -0,0 +1,219 @@ |
| +/*! |
| + |
| +\page localization Localization |
| + |
| +The Windows Installer originally dates from 1999, and it hasn't kept up with the times. |
| +Most problematic is that as of this writing [2013 Nov] it doesn't have official support for Unicode. |
| +Support is largely undocumented, and in any case is only partial. |
| +This causes all sorts of problems with modern tools, many of which default to UTF-8. |
| +WiX, as XML, does not exactly default to UTF-8, but it's close to it by convention. |
| +Some languages are Unicode-only, notably those based on Indic scripts. |
| + |
| +Overall, these limitation generate two kinds of issues. |
| +The first class is deficits in the shipped product. |
| +In short, some strings in certain languages simply cannot be localized. |
| +The second class is a burden on compliance testing before release. |
| +Unicode support is not official in Windows Installer; indeed it's only partial support at that. |
| +But also the WiX toolset does not provide a complete set of assurance tools, |
| + even though all the necessary compiler mechanisms seem to be present. |
| +As a result, manual verification of localization must occur for each language. |
| + |
| +\par MSI format |
| + |
| +The format of Windows Installer affects these issues. |
| +While it's a proprietary format, it's not totally opaque. |
| +Overall, it's uses the [COM Structured Storage][structured_storage] format, |
| + which is just an archive file with an internal file system. |
| +In their lingo, a "storage" is a directory and "stream" is a file. |
| +Two important aspects of localization use mechanisms whose documentation uses these terms. |
| +The "Summary Information Stream" is a file that provides metadata for the installer database. |
| +Embedded language transforms sit in the final MSI as storages. |
| +Some commentary on this topic is available on Rob Mensching's blog; |
| + he's the principal author of WiX and a former member of the Windows Installer team. |
| +- Mensching |
| + [Inside the MSI file format] (http://robmensching.com/blog/posts/2003/11/25/inside-the-msi-file-format) |
| +- Mensching |
| + [Inside the MSI file format, again] (http://robmensching.com/blog/posts/2004/2/10/inside-the-msi-file-format-again) |
| + |
| +[structured_storage]: http://msdn.microsoft.com/en-us/library/aa380369%28VS.85%29.aspx |
| + |
| +The Summary Information Stream is an important structure for two reasons. |
| +The first is that it needs to be localized. |
| +It contains the strings that appear in the Control Panel tool that lists all installed programs. |
| +This includes basic things like the name of the program. |
| +The second is that one of its fields lists the languages supported in a multiple-language installer. |
| +- MSDN |
| + [Summary Property Descriptions] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa372049%28v=vs.85%29.aspx) |
| + |
| +It's worth noting that these summary properties were not originally designed for installation. |
| +Rather, they are derived from the summary information for documents defined by the COM structured storage interface. |
| +The names originally used for these fields were carried over to the SIS for installers, |
| + even when they didn't fit at all, |
| +One example is use of the page count field to represent the minimum required installer version. |
| +Likewise, certain fields were dropped entirely, such as "total editing time". |
| +- MSDN |
| + [Summary Information Property Set] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa380376%28v=vs.85%29.aspx) |
| + This page is about COM Structured Storage, not Windows Installer. |
| + |
| +The tool OpenMCDF, available on sourceforge, is a .NET component to manipulate structured storage. |
| +It contains a simple structured storage browser that can show if embedded transforms are present. |
| +- SourceForge |
| + [OpenMCDF] (http://sourceforge.net/projects/openmcdf/) |
| + |
| + |
| +\par Installation Strings |
| + |
| +A simple MSI contains two internal files that contain localizable strings. |
| +- An installation database itself. |
| + This is a set of tables, some of which contain properties and UI strings. |
| +- The Summary Information Stream. |
| + This data appears principally in the Add/Remove Programs Menu within the Control Panel. |
| +We're also using embedded directories (storages) to store language-specific transforms. |
| +These transforms, however, are generated from simple MSI's that contain only the two files above. |
| + |
| +These two files are localized differently and have different levels of support for Unicode. |
| +In short, the installer database mostly supports Unicode and the SIS does not. |
| +Unicode is not officially supported anywhere, but it seems to work in practice for installer databases. |
| +The SIS is another story. |
| +Trying to use Unicode in the SIS leads to Windows thinking the MSI is malformed and refusing to install it. |
| +The SIS can be localized into many languages, but not all; see the section on code pages below. |
| + |
| + |
| +\par Language Identifiers |
| + |
| +Microsoft has its own system of identifying languages. |
| +(Could it have been otherwise?) |
| +A language is identified with an LCID, which variously stands for language code identifier or locale identifier. |
| +The inconsistency is internal to Microsoft, because it itself doesn't have a completely consistent usage of LCID. |
| +The earlier LCID is the one accepted by Windows system calls; |
| + the later one is used for the .NET library. |
| +They are mostly the same, but they're not identical. |
| + |
| +Windows installer does not itself use any of the textual identifiers that are available, |
| + such as ISO 693 or its IETF usage. |
| +WiX has some support for textual identifiers, but doesn't document how LCID's are generated from them. |
| +As a result, WiX should be considered unreliable for this. |
| +LCID reference tables are usually specified as hexadecimal constant, |
| + since the LCID itself is a bit field. |
| +Infuriatingly, WiX doesn't support hexadecimal integer constants, |
| + requiring a conversion of the LCID to decimal. |
| + |
| +The notion of a character set is given by a code page, |
| + which is simply an assignment of character strings to glyph strings. |
| +It's not as simple as a character-to-glyph map, |
| + since a double-byte character set (DBCS) uses an internal escape code |
| + and mixes single- and double-byte characters. |
| +Regardless of these details, the important distinction is between the so-called "ANSI code pages" and Unicode. |
| + |
| +- MSDN |
| + [Language Identifiers] (http://msdn.microsoft.com/en-us/library/dd318691%28v=vs.85%29.aspx) |
| + The format of LCID as the operating system recognizes them. |
| +- MSDN |
| + [Language Identifier Constants and Strings] (http://msdn.microsoft.com/en-us/library/dd318693%28v=vs.85%29.aspx). |
| + Note: This is for the .NET version, but contains enough information to derive the Window system call version. |
| + |
| + |
| +\par Code Pages |
| + |
| +The SIS is the only place where code pages really matter, because the SIS doesn't support Unicode. |
| + |
| +Microsoft has its own system of character sets, called code pages. |
| +Code pages predate Unicode, and while Microsoft is shifting over to Unicode, the transition isn't complete. |
| +As long as the Windows Installer doesn't fully support Unicode, code pages will be relevant. |
| +Even afterwards, they'll remain relevant until the first working version is the oldest one requiring support. |
| + |
| +A code page specifies an encoding from byte strings to character strings. |
| +The details aren't particularly relevant to installation issues. |
| +What matters are the identifiers. |
| +With Unicode, you can settle on a single identifier, "UTF-8", that represents an encoding. |
| +Code page identifiers are 16-bit integers. |
| +The most common one is 1252 (Latin-1), used for English and most Western European languages. |
| +The code page identifier 65001 means UTF-8, but it's not universally supported. |
| + |
| +Even within the Microsoft environment, code page identifiers are not completely consistent. |
| +The Windows operating system accepts a basic set of code pages; |
| + overall these are alphabetic writing systems. |
| +There are also OEM code pages and non-native code pages. |
| +Code pages are used by Windows for font selection (amongst other things). |
| +Either or both of the code page itself and the requisite display fonts might be absent on a machine. |
| +Furthermore, .NET introduced some others, such one for UTF-16LE, |
| + but these are only available to managed applications. |
| +Luckily, there don't seem to be any multiply-assigned code page identifiers. |
| +All this matters because various tables of code page identifiers tend not to distinguish clearly between these categories. |
| +Even armed with a table, the developer or a linguist should be cautious about selecting a code page. |
| + |
| +The SIS uses, however, so-called "ANSI code pages", sometimes called "Windows code pages". |
| +This doesn't seem to be particularly well-defined term. |
| +It seems to mean "acceptable to a system call ending with A and not W", |
| + but even that's not clear. |
| +What is clear is that even though 65001 for UTF-8 appears in some tables of ANSI code pages, |
| + it isn't one insofar as Windows Installer is concerned. |
| +Caveat emptor, indeed. |
| + |
| +- MSDN |
| + [Code Pages] (http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752%28v=vs.85%29.aspx) |
| + Commentary and historical information. |
| +- Wikipedia |
| + [Code page] (http://en.wikipedia.org/wiki/Code_page) |
| + Do not take as authoritative, but nevertheless useful for understanding some things you might see elsewhere. |
| +- MSDN |
| + [Code Page Identifiers] (http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx) |
| +- Heath Stewart's blog |
| + [MSI Databases and Code Pages] (http://blogs.msdn.com/b/heaths/archive/2005/10/05/msi-databases-and-code-pages.aspx) |
| + Explains how older tools dealt with this issue; WiX is much easier. |
| + Contains some useful background information. |
| +- Mailing list wix-users |
| + [Build time selection of codepage...] (http://comments.gmane.org/gmane.comp.windows.devel.wix.user/44388) |
| + One developer decided not to localize the SIS at all. |
| +- [Character Sets And Code Pages At The Push Of A Button] (http://www.i18nguy.com/unicode/codepages.html) |
| + A wealth of information about the specific code page and other encodings. |
| + Hopefully we never need to dig this far in. |
| + |
| + |
| +\par Multiple-language Installers |
| + |
| +The Windows Installer does not have good mechanisms for multiple languages. |
| +Each installation package is a single-language installer at the time of installation. |
| +In some situations it's feasible to simply ship out single-language installers, |
| + such as internal deployment within a multinational company. |
| +In others, where multiple languages are required, this leads to bloat because of duplication of installation assets. |
| + |
| +The documented way to support this situation is to deliver two assets: |
| + (1) a basic MSI and (2) a transform (MST) that changes the language of the MSI. |
| +There's syntax on the `msiexec` command line to apply a transform before installation. |
| +For consumer software, this is too much, so people started using an installation driver executable. |
| +This executable would typically present a menu for the locale, |
| + or sometimes infer it from the environment. |
| +After selecting a locale, it would select and apply an appropriate transform. |
| +This is the documented way of doing things. |
| + |
| +There is, however, an undocumented way to automatically apply a transform. |
| +The Summary Information has a property called `Template` that contains a list of languages supported. |
| +These languages are specified by a language identifer (LCID), a decimal integer, |
| + whose sublanguage may either be generic (zero) or specific. |
| +For example, 1033 is US English. |
| +The first (or only) language on the list specifies the language of the non-transformed installation database. |
| +Each subsequent identifier specifies an additional language. |
| +The MSI must contain an embedded transform for each additional language. |
| +(If it doens't, Windows Installer throws up an error message and aborts installation.) |
| +An embedded transform is an MST file in a substorage (directory) whose name is the decimal LCID. |
| +Windows Installer checks the language list before installation proper begins to determine the most appropriate language. |
| +If finds one that's not first on the list, it applies its embedded transform before installation proper begins. |
| +Otherwise it uses the installer database as-is, without any transformation. |
| + |
| +- installsite.org |
| + [Multi-Language MSI Packages without Setup.exe Launcher] (http://www.installsite.org/pages/en/msi/articles/embeddedlang/) |
| + The original page that documents the automatic application of embedded language transforms. |
| + First written by Andreas Kerl at Microsoft Germany and then translated into English. |
| +- MSDN |
| + [Template Summary property] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa372070%28v=vs.85%29.aspx) |
| + Documents the format of the Template property in the SIS. |
| + From this page: "Merge Modules are the only packages that may have multiple languages." |
| + Contrary to this quotation, MSI files are also allowed to have multiple languages, although such support is undocumented. |
| +- MSDN |
| + [Embedded Transforms] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa368347%28v=vs.85%29.aspx) |
| + This page not only isn't very informative, it's also slightly wrong. |
| + There's an implication that an embedded transform is stored as a file, which is only sort-of right; |
| + the actual mechanism is as a substorage (directory). |
| + |
| +*/ |