Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code

Side by Side Diff: installer/src/documentation/localization.dox

Issue 5394579117309952: Installer localization and new build system (Closed)
Patch Set: Created Dec. 10, 2013, 4:11 a.m.
Left:
Right:
Use n/p to move between diff chunks; N/P to move between comments.
Jump to:
View unified diff | Download patch
« no previous file with comments | « installer/src/documentation/build_process.dox ('k') | installer/src/documentation/mainpage.dox » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
(Empty)
1 /*!
2
3 \page localization Localization
4
5 The Windows Installer originally dates from 1999, and it hasn't kept up with the times.
6 Most problematic is that as of this writing [2013 Nov] it doesn't have official support for Unicode.
7 Support is largely undocumented, and in any case is only partial.
8 This causes all sorts of problems with modern tools, many of which default to UT F-8.
9 WiX, as XML, does not exactly default to UTF-8, but it's close to it by conventi on.
10 Some languages are Unicode-only, notably those based on Indic scripts.
11
12 Overall, these limitation generate two kinds of issues.
13 The first class is deficits in the shipped product.
14 In short, some strings in certain languages simply cannot be localized.
15 The second class is a burden on compliance testing before release.
16 Unicode support is not official in Windows Installer; indeed it's only partial s upport at that.
17 But also the WiX toolset does not provide a complete set of assurance tools,
18 even though all the necessary compiler mechanisms seem to be present.
19 As a result, manual verification of localization must occur for each language.
20
21 \par MSI format
22
23 The format of Windows Installer affects these issues.
24 While it's a proprietary format, it's not totally opaque.
25 Overall, it's uses the [COM Structured Storage][structured_storage] format,
26 which is just an archive file with an internal file system.
27 In their lingo, a "storage" is a directory and "stream" is a file.
28 Two important aspects of localization use mechanisms whose documentation uses th ese terms.
29 The "Summary Information Stream" is a file that provides metadata for the instal ler database.
30 Embedded language transforms sit in the final MSI as storages.
31 Some commentary on this topic is available on Rob Mensching's blog;
32 he's the principal author of WiX and a former member of the Windows Installer team.
33 - Mensching
34 [Inside the MSI file format] (http://robmensching.com/blog/posts/2003/11/25/in side-the-msi-file-format)
35 - Mensching
36 [Inside the MSI file format, again] (http://robmensching.com/blog/posts/2004/2 /10/inside-the-msi-file-format-again)
37
38 [structured_storage]: http://msdn.microsoft.com/en-us/library/aa380369%28VS.85%2 9.aspx
39
40 The Summary Information Stream is an important structure for two reasons.
41 The first is that it needs to be localized.
42 It contains the strings that appear in the Control Panel tool that lists all ins talled programs.
43 This includes basic things like the name of the program.
44 The second is that one of its fields lists the languages supported in a multiple -language installer.
45 - MSDN
46 [Summary Property Descriptions] (http://msdn.microsoft.com/en-us/library/windo ws/desktop/aa372049%28v=vs.85%29.aspx)
47
48 It's worth noting that these summary properties were not originally designed for installation.
49 Rather, they are derived from the summary information for documents defined by t he COM structured storage interface.
50 The names originally used for these fields were carried over to the SIS for inst allers,
51 even when they didn't fit at all,
52 One example is use of the page count field to represent the minimum required ins taller version.
53 Likewise, certain fields were dropped entirely, such as "total editing time".
54 - MSDN
55 [Summary Information Property Set] (http://msdn.microsoft.com/en-us/library/wi ndows/desktop/aa380376%28v=vs.85%29.aspx)
56 This page is about COM Structured Storage, not Windows Installer.
57
58 The tool OpenMCDF, available on sourceforge, is a .NET component to manipulate s tructured storage.
59 It contains a simple structured storage browser that can show if embedded transf orms are present.
60 - SourceForge
61 [OpenMCDF] (http://sourceforge.net/projects/openmcdf/)
62
63
64 \par Installation Strings
65
66 A simple MSI contains two internal files that contain localizable strings.
67 - An installation database itself.
68 This is a set of tables, some of which contain properties and UI strings.
69 - The Summary Information Stream.
70 This data appears principally in the Add/Remove Programs Menu within the Contr ol Panel.
71 We're also using embedded directories (storages) to store language-specific tran sforms.
72 These transforms, however, are generated from simple MSI's that contain only the two files above.
73
74 These two files are localized differently and have different levels of support f or Unicode.
75 In short, the installer database mostly supports Unicode and the SIS does not.
76 Unicode is not officially supported anywhere, but it seems to work in practice f or installer databases.
77 The SIS is another story.
78 Trying to use Unicode in the SIS leads to Windows thinking the MSI is malformed and refusing to install it.
79 The SIS can be localized into many languages, but not all; see the section on co de pages below.
80
81
82 \par Language Identifiers
83
84 Microsoft has its own system of identifying languages.
85 (Could it have been otherwise?)
86 A language is identified with an LCID, which variously stands for language code identifier or locale identifier.
87 The inconsistency is internal to Microsoft, because it itself doesn't have a com pletely consistent usage of LCID.
88 The earlier LCID is the one accepted by Windows system calls;
89 the later one is used for the .NET library.
90 They are mostly the same, but they're not identical.
91
92 Windows installer does not itself use any of the textual identifiers that are av ailable,
93 such as ISO 693 or its IETF usage.
94 WiX has some support for textual identifiers, but doesn't document how LCID's ar e generated from them.
95 As a result, WiX should be considered unreliable for this.
96 LCID reference tables are usually specified as hexadecimal constant,
97 since the LCID itself is a bit field.
98 Infuriatingly, WiX doesn't support hexadecimal integer constants,
99 requiring a conversion of the LCID to decimal.
100
101 The notion of a character set is given by a code page,
102 which is simply an assignment of character strings to glyph strings.
103 It's not as simple as a character-to-glyph map,
104 since a double-byte character set (DBCS) uses an internal escape code
105 and mixes single- and double-byte characters.
106 Regardless of these details, the important distinction is between the so-called "ANSI code pages" and Unicode.
107
108 - MSDN
109 [Language Identifiers] (http://msdn.microsoft.com/en-us/library/dd318691%28v=v s.85%29.aspx)
110 The format of LCID as the operating system recognizes them.
111 - MSDN
112 [Language Identifier Constants and Strings] (http://msdn.microsoft.com/en-us/l ibrary/dd318693%28v=vs.85%29.aspx).
113 Note: This is for the .NET version, but contains enough information to derive the Window system call version.
114
115
116 \par Code Pages
117
118 The SIS is the only place where code pages really matter, because the SIS doesn' t support Unicode.
119
120 Microsoft has its own system of character sets, called code pages.
121 Code pages predate Unicode, and while Microsoft is shifting over to Unicode, the transition isn't complete.
122 As long as the Windows Installer doesn't fully support Unicode, code pages will be relevant.
123 Even afterwards, they'll remain relevant until the first working version is the oldest one requiring support.
124
125 A code page specifies an encoding from byte strings to character strings.
126 The details aren't particularly relevant to installation issues.
127 What matters are the identifiers.
128 With Unicode, you can settle on a single identifier, "UTF-8", that represents an encoding.
129 Code page identifiers are 16-bit integers.
130 The most common one is 1252 (Latin-1), used for English and most Western Europea n languages.
131 The code page identifier 65001 means UTF-8, but it's not universally supported.
132
133 Even within the Microsoft environment, code page identifiers are not completely consistent.
134 The Windows operating system accepts a basic set of code pages;
135 overall these are alphabetic writing systems.
136 There are also OEM code pages and non-native code pages.
137 Code pages are used by Windows for font selection (amongst other things).
138 Either or both of the code page itself and the requisite display fonts might be absent on a machine.
139 Furthermore, .NET introduced some others, such one for UTF-16LE,
140 but these are only available to managed applications.
141 Luckily, there don't seem to be any multiply-assigned code page identifiers.
142 All this matters because various tables of code page identifiers tend not to dis tinguish clearly between these categories.
143 Even armed with a table, the developer or a linguist should be cautious about se lecting a code page.
144
145 The SIS uses, however, so-called "ANSI code pages", sometimes called "Windows co de pages".
146 This doesn't seem to be particularly well-defined term.
147 It seems to mean "acceptable to a system call ending with A and not W",
148 but even that's not clear.
149 What is clear is that even though 65001 for UTF-8 appears in some tables of ANSI code pages,
150 it isn't one insofar as Windows Installer is concerned.
151 Caveat emptor, indeed.
152
153 - MSDN
154 [Code Pages] (http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752 %28v=vs.85%29.aspx)
155 Commentary and historical information.
156 - Wikipedia
157 [Code page] (http://en.wikipedia.org/wiki/Code_page)
158 Do not take as authoritative, but nevertheless useful for understanding some t hings you might see elsewhere.
159 - MSDN
160 [Code Page Identifiers] (http://msdn.microsoft.com/en-us/library/dd317756%28VS .85%29.aspx)
161 - Heath Stewart's blog
162 [MSI Databases and Code Pages] (http://blogs.msdn.com/b/heaths/archive/2005/10 /05/msi-databases-and-code-pages.aspx)
163 Explains how older tools dealt with this issue; WiX is much easier.
164 Contains some useful background information.
165 - Mailing list wix-users
166 [Build time selection of codepage...] (http://comments.gmane.org/gmane.comp.wi ndows.devel.wix.user/44388)
167 One developer decided not to localize the SIS at all.
168 - [Character Sets And Code Pages At The Push Of A Button] (http://www.i18nguy.co m/unicode/codepages.html)
169 A wealth of information about the specific code page and other encodings.
170 Hopefully we never need to dig this far in.
171
172
173 \par Multiple-language Installers
174
175 The Windows Installer does not have good mechanisms for multiple languages.
176 Each installation package is a single-language installer at the time of installa tion.
177 In some situations it's feasible to simply ship out single-language installers,
178 such as internal deployment within a multinational company.
179 In others, where multiple languages are required, this leads to bloat because of duplication of installation assets.
180
181 The documented way to support this situation is to deliver two assets:
182 (1) a basic MSI and (2) a transform (MST) that changes the language of the MSI .
183 There's syntax on the `msiexec` command line to apply a transform before install ation.
184 For consumer software, this is too much, so people started using an installation driver executable.
185 This executable would typically present a menu for the locale,
186 or sometimes infer it from the environment.
187 After selecting a locale, it would select and apply an appropriate transform.
188 This is the documented way of doing things.
189
190 There is, however, an undocumented way to automatically apply a transform.
191 The Summary Information has a property called `Template` that contains a list of languages supported.
192 These languages are specified by a language identifer (LCID), a decimal integer,
193 whose sublanguage may either be generic (zero) or specific.
194 For example, 1033 is US English.
195 The first (or only) language on the list specifies the language of the non-trans formed installation database.
196 Each subsequent identifier specifies an additional language.
197 The MSI must contain an embedded transform for each additional language.
198 (If it doens't, Windows Installer throws up an error message and aborts installa tion.)
199 An embedded transform is an MST file in a substorage (directory) whose name is t he decimal LCID.
200 Windows Installer checks the language list before installation proper begins to determine the most appropriate language.
201 If finds one that's not first on the list, it applies its embedded transform bef ore installation proper begins.
202 Otherwise it uses the installer database as-is, without any transformation.
203
204 - installsite.org
205 [Multi-Language MSI Packages without Setup.exe Launcher] (http://www.installsi te.org/pages/en/msi/articles/embeddedlang/)
206 The original page that documents the automatic application of embedded languag e transforms.
207 First written by Andreas Kerl at Microsoft Germany and then translated into En glish.
208 - MSDN
209 [Template Summary property] (http://msdn.microsoft.com/en-us/library/windows/d esktop/aa372070%28v=vs.85%29.aspx)
210 Documents the format of the Template property in the SIS.
211 From this page: "Merge Modules are the only packages that may have multiple la nguages."
212 Contrary to this quotation, MSI files are also allowed to have multiple langua ges, although such support is undocumented.
213 - MSDN
214 [Embedded Transforms] (http://msdn.microsoft.com/en-us/library/windows/desktop /aa368347%28v=vs.85%29.aspx)
215 This page not only isn't very informative, it's also slightly wrong.
216 There's an implication that an embedded transform is stored as a file, which i s only sort-of right;
217 the actual mechanism is as a substorage (directory).
218
219 */
OLDNEW
« no previous file with comments | « installer/src/documentation/build_process.dox ('k') | installer/src/documentation/mainpage.dox » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld