Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code

Unified Diff: installer/src/documentation/localization.dox

Issue 5394579117309952: Installer localization and new build system (Closed)
Patch Set: Created Dec. 10, 2013, 4:11 a.m.
Use n/p to move between diff chunks; N/P to move between comments.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « installer/src/documentation/build_process.dox ('k') | installer/src/documentation/mainpage.dox » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: installer/src/documentation/localization.dox
===================================================================
new file mode 100644
--- /dev/null
+++ b/installer/src/documentation/localization.dox
@@ -0,0 +1,219 @@
+/*!
+
+\page localization Localization
+
+The Windows Installer originally dates from 1999, and it hasn't kept up with the times.
+Most problematic is that as of this writing [2013 Nov] it doesn't have official support for Unicode.
+Support is largely undocumented, and in any case is only partial.
+This causes all sorts of problems with modern tools, many of which default to UTF-8.
+WiX, as XML, does not exactly default to UTF-8, but it's close to it by convention.
+Some languages are Unicode-only, notably those based on Indic scripts.
+
+Overall, these limitation generate two kinds of issues.
+The first class is deficits in the shipped product.
+In short, some strings in certain languages simply cannot be localized.
+The second class is a burden on compliance testing before release.
+Unicode support is not official in Windows Installer; indeed it's only partial support at that.
+But also the WiX toolset does not provide a complete set of assurance tools,
+ even though all the necessary compiler mechanisms seem to be present.
+As a result, manual verification of localization must occur for each language.
+
+\par MSI format
+
+The format of Windows Installer affects these issues.
+While it's a proprietary format, it's not totally opaque.
+Overall, it's uses the [COM Structured Storage][structured_storage] format,
+ which is just an archive file with an internal file system.
+In their lingo, a "storage" is a directory and "stream" is a file.
+Two important aspects of localization use mechanisms whose documentation uses these terms.
+The "Summary Information Stream" is a file that provides metadata for the installer database.
+Embedded language transforms sit in the final MSI as storages.
+Some commentary on this topic is available on Rob Mensching's blog;
+ he's the principal author of WiX and a former member of the Windows Installer team.
+- Mensching
+ [Inside the MSI file format] (http://robmensching.com/blog/posts/2003/11/25/inside-the-msi-file-format)
+- Mensching
+ [Inside the MSI file format, again] (http://robmensching.com/blog/posts/2004/2/10/inside-the-msi-file-format-again)
+
+[structured_storage]: http://msdn.microsoft.com/en-us/library/aa380369%28VS.85%29.aspx
+
+The Summary Information Stream is an important structure for two reasons.
+The first is that it needs to be localized.
+It contains the strings that appear in the Control Panel tool that lists all installed programs.
+This includes basic things like the name of the program.
+The second is that one of its fields lists the languages supported in a multiple-language installer.
+- MSDN
+ [Summary Property Descriptions] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa372049%28v=vs.85%29.aspx)
+
+It's worth noting that these summary properties were not originally designed for installation.
+Rather, they are derived from the summary information for documents defined by the COM structured storage interface.
+The names originally used for these fields were carried over to the SIS for installers,
+ even when they didn't fit at all,
+One example is use of the page count field to represent the minimum required installer version.
+Likewise, certain fields were dropped entirely, such as "total editing time".
+- MSDN
+ [Summary Information Property Set] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa380376%28v=vs.85%29.aspx)
+ This page is about COM Structured Storage, not Windows Installer.
+
+The tool OpenMCDF, available on sourceforge, is a .NET component to manipulate structured storage.
+It contains a simple structured storage browser that can show if embedded transforms are present.
+- SourceForge
+ [OpenMCDF] (http://sourceforge.net/projects/openmcdf/)
+
+
+\par Installation Strings
+
+A simple MSI contains two internal files that contain localizable strings.
+- An installation database itself.
+ This is a set of tables, some of which contain properties and UI strings.
+- The Summary Information Stream.
+ This data appears principally in the Add/Remove Programs Menu within the Control Panel.
+We're also using embedded directories (storages) to store language-specific transforms.
+These transforms, however, are generated from simple MSI's that contain only the two files above.
+
+These two files are localized differently and have different levels of support for Unicode.
+In short, the installer database mostly supports Unicode and the SIS does not.
+Unicode is not officially supported anywhere, but it seems to work in practice for installer databases.
+The SIS is another story.
+Trying to use Unicode in the SIS leads to Windows thinking the MSI is malformed and refusing to install it.
+The SIS can be localized into many languages, but not all; see the section on code pages below.
+
+
+\par Language Identifiers
+
+Microsoft has its own system of identifying languages.
+(Could it have been otherwise?)
+A language is identified with an LCID, which variously stands for language code identifier or locale identifier.
+The inconsistency is internal to Microsoft, because it itself doesn't have a completely consistent usage of LCID.
+The earlier LCID is the one accepted by Windows system calls;
+ the later one is used for the .NET library.
+They are mostly the same, but they're not identical.
+
+Windows installer does not itself use any of the textual identifiers that are available,
+ such as ISO 693 or its IETF usage.
+WiX has some support for textual identifiers, but doesn't document how LCID's are generated from them.
+As a result, WiX should be considered unreliable for this.
+LCID reference tables are usually specified as hexadecimal constant,
+ since the LCID itself is a bit field.
+Infuriatingly, WiX doesn't support hexadecimal integer constants,
+ requiring a conversion of the LCID to decimal.
+
+The notion of a character set is given by a code page,
+ which is simply an assignment of character strings to glyph strings.
+It's not as simple as a character-to-glyph map,
+ since a double-byte character set (DBCS) uses an internal escape code
+ and mixes single- and double-byte characters.
+Regardless of these details, the important distinction is between the so-called "ANSI code pages" and Unicode.
+
+- MSDN
+ [Language Identifiers] (http://msdn.microsoft.com/en-us/library/dd318691%28v=vs.85%29.aspx)
+ The format of LCID as the operating system recognizes them.
+- MSDN
+ [Language Identifier Constants and Strings] (http://msdn.microsoft.com/en-us/library/dd318693%28v=vs.85%29.aspx).
+ Note: This is for the .NET version, but contains enough information to derive the Window system call version.
+
+
+\par Code Pages
+
+The SIS is the only place where code pages really matter, because the SIS doesn't support Unicode.
+
+Microsoft has its own system of character sets, called code pages.
+Code pages predate Unicode, and while Microsoft is shifting over to Unicode, the transition isn't complete.
+As long as the Windows Installer doesn't fully support Unicode, code pages will be relevant.
+Even afterwards, they'll remain relevant until the first working version is the oldest one requiring support.
+
+A code page specifies an encoding from byte strings to character strings.
+The details aren't particularly relevant to installation issues.
+What matters are the identifiers.
+With Unicode, you can settle on a single identifier, "UTF-8", that represents an encoding.
+Code page identifiers are 16-bit integers.
+The most common one is 1252 (Latin-1), used for English and most Western European languages.
+The code page identifier 65001 means UTF-8, but it's not universally supported.
+
+Even within the Microsoft environment, code page identifiers are not completely consistent.
+The Windows operating system accepts a basic set of code pages;
+ overall these are alphabetic writing systems.
+There are also OEM code pages and non-native code pages.
+Code pages are used by Windows for font selection (amongst other things).
+Either or both of the code page itself and the requisite display fonts might be absent on a machine.
+Furthermore, .NET introduced some others, such one for UTF-16LE,
+ but these are only available to managed applications.
+Luckily, there don't seem to be any multiply-assigned code page identifiers.
+All this matters because various tables of code page identifiers tend not to distinguish clearly between these categories.
+Even armed with a table, the developer or a linguist should be cautious about selecting a code page.
+
+The SIS uses, however, so-called "ANSI code pages", sometimes called "Windows code pages".
+This doesn't seem to be particularly well-defined term.
+It seems to mean "acceptable to a system call ending with A and not W",
+ but even that's not clear.
+What is clear is that even though 65001 for UTF-8 appears in some tables of ANSI code pages,
+ it isn't one insofar as Windows Installer is concerned.
+Caveat emptor, indeed.
+
+- MSDN
+ [Code Pages] (http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752%28v=vs.85%29.aspx)
+ Commentary and historical information.
+- Wikipedia
+ [Code page] (http://en.wikipedia.org/wiki/Code_page)
+ Do not take as authoritative, but nevertheless useful for understanding some things you might see elsewhere.
+- MSDN
+ [Code Page Identifiers] (http://msdn.microsoft.com/en-us/library/dd317756%28VS.85%29.aspx)
+- Heath Stewart's blog
+ [MSI Databases and Code Pages] (http://blogs.msdn.com/b/heaths/archive/2005/10/05/msi-databases-and-code-pages.aspx)
+ Explains how older tools dealt with this issue; WiX is much easier.
+ Contains some useful background information.
+- Mailing list wix-users
+ [Build time selection of codepage...] (http://comments.gmane.org/gmane.comp.windows.devel.wix.user/44388)
+ One developer decided not to localize the SIS at all.
+- [Character Sets And Code Pages At The Push Of A Button] (http://www.i18nguy.com/unicode/codepages.html)
+ A wealth of information about the specific code page and other encodings.
+ Hopefully we never need to dig this far in.
+
+
+\par Multiple-language Installers
+
+The Windows Installer does not have good mechanisms for multiple languages.
+Each installation package is a single-language installer at the time of installation.
+In some situations it's feasible to simply ship out single-language installers,
+ such as internal deployment within a multinational company.
+In others, where multiple languages are required, this leads to bloat because of duplication of installation assets.
+
+The documented way to support this situation is to deliver two assets:
+ (1) a basic MSI and (2) a transform (MST) that changes the language of the MSI.
+There's syntax on the `msiexec` command line to apply a transform before installation.
+For consumer software, this is too much, so people started using an installation driver executable.
+This executable would typically present a menu for the locale,
+ or sometimes infer it from the environment.
+After selecting a locale, it would select and apply an appropriate transform.
+This is the documented way of doing things.
+
+There is, however, an undocumented way to automatically apply a transform.
+The Summary Information has a property called `Template` that contains a list of languages supported.
+These languages are specified by a language identifer (LCID), a decimal integer,
+ whose sublanguage may either be generic (zero) or specific.
+For example, 1033 is US English.
+The first (or only) language on the list specifies the language of the non-transformed installation database.
+Each subsequent identifier specifies an additional language.
+The MSI must contain an embedded transform for each additional language.
+(If it doens't, Windows Installer throws up an error message and aborts installation.)
+An embedded transform is an MST file in a substorage (directory) whose name is the decimal LCID.
+Windows Installer checks the language list before installation proper begins to determine the most appropriate language.
+If finds one that's not first on the list, it applies its embedded transform before installation proper begins.
+Otherwise it uses the installer database as-is, without any transformation.
+
+- installsite.org
+ [Multi-Language MSI Packages without Setup.exe Launcher] (http://www.installsite.org/pages/en/msi/articles/embeddedlang/)
+ The original page that documents the automatic application of embedded language transforms.
+ First written by Andreas Kerl at Microsoft Germany and then translated into English.
+- MSDN
+ [Template Summary property] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa372070%28v=vs.85%29.aspx)
+ Documents the format of the Template property in the SIS.
+ From this page: "Merge Modules are the only packages that may have multiple languages."
+ Contrary to this quotation, MSI files are also allowed to have multiple languages, although such support is undocumented.
+- MSDN
+ [Embedded Transforms] (http://msdn.microsoft.com/en-us/library/windows/desktop/aa368347%28v=vs.85%29.aspx)
+ This page not only isn't very informative, it's also slightly wrong.
+ There's an implication that an embedded transform is stored as a file, which is only sort-of right;
+ the actual mechanism is as a substorage (directory).
+
+*/
« no previous file with comments | « installer/src/documentation/build_process.dox ('k') | installer/src/documentation/mainpage.dox » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld