| OLD | NEW | 
|---|
| (Empty) |  | 
|  | 1 /*! | 
|  | 2 | 
|  | 3 \page localization Localization | 
|  | 4 | 
|  | 5 The Windows Installer originally dates from 1999, and it hasn't kept up with the
      times. | 
|  | 6 Most problematic is that as of this writing [2013 Nov] it doesn't have official 
     support for Unicode. | 
|  | 7 Support is largely undocumented, and in any case is only partial. | 
|  | 8 This causes all sorts of problems with modern tools, many of which default to UT
     F-8. | 
|  | 9 WiX, as XML, does not exactly default to UTF-8, but it's close to it by conventi
     on. | 
|  | 10 Some languages are Unicode-only, notably those based on Indic scripts. | 
|  | 11 | 
|  | 12 Overall, these limitation generate two kinds of issues. | 
|  | 13 The first class is deficits in the shipped product. | 
|  | 14 In short, some strings in certain languages simply cannot be localized. | 
|  | 15 The second class is a burden on compliance testing before release. | 
|  | 16 Unicode support is not official in Windows Installer; indeed it's only partial s
     upport at that. | 
|  | 17 But also the WiX toolset does not provide a complete set of assurance tools, | 
|  | 18   even though all the necessary compiler mechanisms seem to be present. | 
|  | 19 As a result, manual verification of localization must occur for each language. | 
|  | 20 | 
|  | 21 \par MSI format | 
|  | 22 | 
|  | 23 The format of Windows Installer affects these issues. | 
|  | 24 While it's a proprietary format, it's not totally opaque. | 
|  | 25 Overall, it's uses the [COM Structured Storage][structured_storage] format, | 
|  | 26   which is just an archive file with an internal file system. | 
|  | 27 In their lingo, a "storage" is a directory and "stream" is a file. | 
|  | 28 Two important aspects of localization use mechanisms whose documentation uses th
     ese terms. | 
|  | 29 The "Summary Information Stream" is a file that provides metadata for the instal
     ler database. | 
|  | 30 Embedded language transforms sit in the final MSI as storages. | 
|  | 31 Some commentary on this topic is available on Rob Mensching's blog; | 
|  | 32   he's the principal author of WiX and a former member of the Windows Installer 
     team. | 
|  | 33 - Mensching | 
|  | 34   [Inside the MSI file format] (http://robmensching.com/blog/posts/2003/11/25/in
     side-the-msi-file-format) | 
|  | 35 - Mensching | 
|  | 36   [Inside the MSI file format, again] (http://robmensching.com/blog/posts/2004/2
     /10/inside-the-msi-file-format-again) | 
|  | 37 | 
|  | 38 [structured_storage]: http://msdn.microsoft.com/en-us/library/aa380369%28VS.85%2
     9.aspx | 
|  | 39 | 
|  | 40 The Summary Information Stream is an important structure for two reasons. | 
|  | 41 The first is that it needs to be localized. | 
|  | 42 It contains the strings that appear in the Control Panel tool that lists all ins
     talled programs. | 
|  | 43 This includes basic things like the name of the program. | 
|  | 44 The second is that one of its fields lists the languages supported in a multiple
     -language installer. | 
|  | 45 - MSDN | 
|  | 46   [Summary Property Descriptions] (http://msdn.microsoft.com/en-us/library/windo
     ws/desktop/aa372049%28v=vs.85%29.aspx) | 
|  | 47 | 
|  | 48 It's worth noting that these summary properties were not originally designed for
      installation. | 
|  | 49 Rather, they are derived from the summary information for documents defined by t
     he COM structured storage interface. | 
|  | 50 The names originally used for these fields were carried over to the SIS for inst
     allers, | 
|  | 51     even when they didn't fit at all, | 
|  | 52 One example is use of the page count field to represent the minimum required ins
     taller version. | 
|  | 53 Likewise, certain fields were dropped entirely, such as "total editing time". | 
|  | 54 - MSDN | 
|  | 55   [Summary Information Property Set] (http://msdn.microsoft.com/en-us/library/wi
     ndows/desktop/aa380376%28v=vs.85%29.aspx) | 
|  | 56   This page is about COM Structured Storage, not Windows Installer. | 
|  | 57 | 
|  | 58 The tool OpenMCDF, available on sourceforge, is a .NET component to manipulate s
     tructured storage. | 
|  | 59 It contains a simple structured storage browser that can show if embedded transf
     orms are present. | 
|  | 60 - SourceForge | 
|  | 61   [OpenMCDF] (http://sourceforge.net/projects/openmcdf/) | 
|  | 62 | 
|  | 63 | 
|  | 64 \par Installation Strings | 
|  | 65 | 
|  | 66 A simple MSI contains two internal files that contain localizable strings. | 
|  | 67 - An installation database itself. | 
|  | 68   This is a set of tables, some of which contain properties and UI strings. | 
|  | 69 - The Summary Information Stream. | 
|  | 70   This data appears principally in the Add/Remove Programs Menu within the Contr
     ol Panel. | 
|  | 71 We're also using embedded directories (storages) to store language-specific tran
     sforms. | 
|  | 72 These transforms, however, are generated from simple MSI's that contain only the
      two files above. | 
|  | 73 | 
|  | 74 These two files are localized differently and have different levels of support f
     or Unicode. | 
|  | 75 In short, the installer database mostly supports Unicode and the SIS does not. | 
|  | 76 Unicode is not officially supported anywhere, but it seems to work in practice f
     or installer databases. | 
|  | 77 The SIS is another story. | 
|  | 78 Trying to use Unicode in the SIS leads to Windows thinking the MSI is malformed 
     and refusing to install it. | 
|  | 79 The SIS can be localized into many languages, but not all; see the section on co
     de pages below. | 
|  | 80 | 
|  | 81 | 
|  | 82 \par Language Identifiers | 
|  | 83 | 
|  | 84 Microsoft has its own system of identifying languages. | 
|  | 85 (Could it have been otherwise?) | 
|  | 86 A language is identified with an LCID, which variously stands for language code 
     identifier or locale identifier. | 
|  | 87 The inconsistency is internal to Microsoft, because it itself doesn't have a com
     pletely consistent usage of LCID. | 
|  | 88 The earlier LCID is the one accepted by Windows system calls; | 
|  | 89   the later one is used for the .NET library. | 
|  | 90 They are mostly the same, but they're not identical. | 
|  | 91 | 
|  | 92 Windows installer does not itself use any of the textual identifiers that are av
     ailable, | 
|  | 93   such as ISO 693 or its IETF usage. | 
|  | 94 WiX has some support for textual identifiers, but doesn't document how LCID's ar
     e generated from them. | 
|  | 95 As a result, WiX should be considered unreliable for this. | 
|  | 96 LCID reference tables are usually specified as hexadecimal constant, | 
|  | 97   since the LCID itself is a bit field. | 
|  | 98 Infuriatingly, WiX doesn't support hexadecimal integer constants, | 
|  | 99   requiring a conversion of the LCID to decimal. | 
|  | 100 | 
|  | 101 The notion of a character set is given by a code page, | 
|  | 102   which is simply an assignment of character strings to glyph strings. | 
|  | 103 It's not as simple as a character-to-glyph map, | 
|  | 104   since a double-byte character set (DBCS) uses an internal escape code | 
|  | 105   and mixes single- and double-byte characters. | 
|  | 106 Regardless of these details, the important distinction is between the so-called 
     "ANSI code pages" and Unicode. | 
|  | 107 | 
|  | 108 - MSDN | 
|  | 109   [Language Identifiers] (http://msdn.microsoft.com/en-us/library/dd318691%28v=v
     s.85%29.aspx) | 
|  | 110   The format of LCID as the operating system recognizes them. | 
|  | 111 - MSDN | 
|  | 112   [Language Identifier Constants and Strings] (http://msdn.microsoft.com/en-us/l
     ibrary/dd318693%28v=vs.85%29.aspx). | 
|  | 113   Note: This is for the .NET version, but contains enough information to derive 
     the Window system call version. | 
|  | 114 | 
|  | 115 | 
|  | 116 \par Code Pages | 
|  | 117 | 
|  | 118 The SIS is the only place where code pages really matter, because the SIS doesn'
     t support Unicode. | 
|  | 119 | 
|  | 120 Microsoft has its own system of character sets, called code pages. | 
|  | 121 Code pages predate Unicode, and while Microsoft is shifting over to Unicode, the
      transition isn't complete. | 
|  | 122 As long as the Windows Installer doesn't fully support Unicode, code pages will 
     be relevant. | 
|  | 123 Even afterwards, they'll remain relevant until the first working version is the 
     oldest one requiring support. | 
|  | 124 | 
|  | 125 A code page specifies an encoding from byte strings to character strings. | 
|  | 126 The details aren't particularly relevant to installation issues. | 
|  | 127 What matters are the identifiers. | 
|  | 128 With Unicode, you can settle on a single identifier, "UTF-8", that represents an
      encoding. | 
|  | 129 Code page identifiers are 16-bit integers. | 
|  | 130 The most common one is 1252 (Latin-1), used for English and most Western Europea
     n languages. | 
|  | 131 The code page identifier 65001 means UTF-8, but it's not universally supported. | 
|  | 132 | 
|  | 133 Even within the Microsoft environment, code page identifiers are not completely 
     consistent. | 
|  | 134 The Windows operating system accepts a basic set of code pages; | 
|  | 135   overall these are alphabetic writing systems. | 
|  | 136 There are also OEM code pages and non-native code pages. | 
|  | 137 Code pages are used by Windows for font selection (amongst other things). | 
|  | 138 Either or both of the code page itself and the requisite display fonts might be 
     absent on a machine. | 
|  | 139 Furthermore, .NET introduced some others, such one for UTF-16LE, | 
|  | 140   but these are only available to managed applications. | 
|  | 141 Luckily, there don't seem to be any multiply-assigned code page identifiers. | 
|  | 142 All this matters because various tables of code page identifiers tend not to dis
     tinguish clearly between these categories. | 
|  | 143 Even armed with a table, the developer or a linguist should be cautious about se
     lecting a code page. | 
|  | 144 | 
|  | 145 The SIS uses, however, so-called "ANSI code pages", sometimes called "Windows co
     de pages". | 
|  | 146 This doesn't seem to be particularly well-defined term. | 
|  | 147 It seems to mean "acceptable to a system call ending with A and not W", | 
|  | 148   but even that's not clear. | 
|  | 149 What is clear is that even though 65001 for UTF-8 appears in some tables of ANSI
      code pages, | 
|  | 150   it isn't one insofar as Windows Installer is concerned. | 
|  | 151 Caveat emptor, indeed. | 
|  | 152 | 
|  | 153 - MSDN | 
|  | 154   [Code Pages] (http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752
     %28v=vs.85%29.aspx) | 
|  | 155   Commentary and historical information. | 
|  | 156 - Wikipedia | 
|  | 157   [Code page] (http://en.wikipedia.org/wiki/Code_page) | 
|  | 158   Do not take as authoritative, but nevertheless useful for understanding some t
     hings you might see elsewhere. | 
|  | 159 - MSDN | 
|  | 160   [Code Page Identifiers] (http://msdn.microsoft.com/en-us/library/dd317756%28VS
     .85%29.aspx) | 
|  | 161 - Heath Stewart's blog | 
|  | 162   [MSI Databases and Code Pages] (http://blogs.msdn.com/b/heaths/archive/2005/10
     /05/msi-databases-and-code-pages.aspx) | 
|  | 163   Explains how older tools dealt with this issue; WiX is much easier. | 
|  | 164   Contains some useful background information. | 
|  | 165 - Mailing list wix-users | 
|  | 166   [Build time selection of codepage...] (http://comments.gmane.org/gmane.comp.wi
     ndows.devel.wix.user/44388) | 
|  | 167   One developer decided not to localize the SIS at all. | 
|  | 168 - [Character Sets And Code Pages At The Push Of A Button] (http://www.i18nguy.co
     m/unicode/codepages.html) | 
|  | 169   A wealth of information about the specific code page and other encodings. | 
|  | 170   Hopefully we never need to dig this far in. | 
|  | 171 | 
|  | 172 | 
|  | 173 \par Multiple-language Installers | 
|  | 174 | 
|  | 175 The Windows Installer does not have good mechanisms for multiple languages. | 
|  | 176 Each installation package is a single-language installer at the time of installa
     tion. | 
|  | 177 In some situations it's feasible to simply ship out single-language installers, | 
|  | 178   such as internal deployment within a multinational company. | 
|  | 179 In others, where multiple languages are required, this leads to bloat because of
      duplication of installation assets. | 
|  | 180 | 
|  | 181 The documented way to support this situation is to deliver two assets: | 
|  | 182   (1) a basic MSI and (2) a transform (MST) that changes the language of the MSI
     . | 
|  | 183 There's syntax on the `msiexec` command line to apply a transform before install
     ation. | 
|  | 184 For consumer software, this is too much, so people started using an installation
      driver executable. | 
|  | 185 This executable would typically present a menu for the locale, | 
|  | 186   or sometimes infer it from the environment. | 
|  | 187 After selecting a locale, it would select and apply an appropriate transform. | 
|  | 188 This is the documented way of doing things. | 
|  | 189 | 
|  | 190 There is, however, an undocumented way to automatically apply a transform. | 
|  | 191 The Summary Information has a property called `Template` that contains a list of
      languages supported. | 
|  | 192 These languages are specified by a language identifer (LCID), a decimal integer, | 
|  | 193   whose sublanguage may either be generic (zero) or specific. | 
|  | 194 For example, 1033 is US English. | 
|  | 195 The first (or only) language on the list specifies the language of the non-trans
     formed installation database. | 
|  | 196 Each subsequent identifier specifies an additional language. | 
|  | 197 The MSI must contain an embedded transform for each additional language. | 
|  | 198 (If it doens't, Windows Installer throws up an error message and aborts installa
     tion.) | 
|  | 199 An embedded transform is an MST file in a substorage (directory) whose name is t
     he decimal LCID. | 
|  | 200 Windows Installer checks the language list before installation proper begins to 
     determine the most appropriate language. | 
|  | 201 If finds one that's not first on the list, it applies its embedded transform bef
     ore installation proper begins. | 
|  | 202 Otherwise it uses the installer database as-is, without any transformation. | 
|  | 203 | 
|  | 204 - installsite.org | 
|  | 205   [Multi-Language MSI Packages without Setup.exe Launcher] (http://www.installsi
     te.org/pages/en/msi/articles/embeddedlang/) | 
|  | 206   The original page that documents the automatic application of embedded languag
     e transforms. | 
|  | 207   First written by Andreas Kerl at Microsoft Germany and then translated into En
     glish. | 
|  | 208 - MSDN | 
|  | 209   [Template Summary property] (http://msdn.microsoft.com/en-us/library/windows/d
     esktop/aa372070%28v=vs.85%29.aspx) | 
|  | 210   Documents the format of the Template property in the SIS. | 
|  | 211   From this page: "Merge Modules are the only packages that may have multiple la
     nguages." | 
|  | 212   Contrary to this quotation, MSI files are also allowed to have multiple langua
     ges, although such support is undocumented. | 
|  | 213 - MSDN | 
|  | 214   [Embedded Transforms] (http://msdn.microsoft.com/en-us/library/windows/desktop
     /aa368347%28v=vs.85%29.aspx) | 
|  | 215   This page not only isn't very informative, it's also slightly wrong. | 
|  | 216   There's an implication that an embedded transform is stored as a file, which i
     s only sort-of right; | 
|  | 217     the actual mechanism is as a substorage (directory). | 
|  | 218 | 
|  | 219 */ | 
| OLD | NEW | 
|---|