36
How to setup a multi-language, multi-font, multi-format publishing project
IGP:Digital Publisher can be used for the management and publishing of multiple-language editions of a document in multiple formats. It has been used to create monthly magazines in 206 languages delivered in grayscale print PDF, full colour Online PDFs, and all major e-book formats.
There are a number of issues that must be considered when planning a multi-language, multi-format document production project. These are:
All multi-language publishing that targets the Internet must use Unicode-8 encoding. This simplifies the production and support issues. It does mean you will have larger file sizes for high-unicode characters but this is a small price to pay for production and distribution simplification. For example there will typically be approximately a 30% file-size increase for Chinese over using UTF-16.
The advantage of the UTF-8 approach is that any character from any language can be placed side by side with no encoding confusion.
Here is an example with characters from many different languages sitting comfortably side-by-side. As long as the characters have glyphs in the fonts they will display.
A Å Ă Ƣ Λ Й אḀ Ἂ Ⱡ Ꙍ Ꜻ
This set of examples covers Latin 1, A, B, C, D; Cyrillic, Greek, Greek Extended, Cyrillic, Cyrillic extended B, etc.
Font schemes should be organized by script and not language. Remember that many languages use the same scripts. That means many of the characters in Unicode are used by many languages.
The strongest example is Latin. This covers hundreds of languages. It is completely incorrect to regard this script as English. Many fonts do not support the full range of characters required by all latin languages. Commercial fonts are the most guilty here and are usually very weak in Unicode Latin-A and very weak in Latin-B.
Another example is Indic Devanagari. This is a script NOT a language and is used by Hindi, Marathi, Nepali and a number of dialect languages.
It may be necessary to have dedicated fonts for languages such as Eastern European, Vietnamese and other special diacritic languages which have been very badly represented other than by Open Source fonts. In this case the Font Scheme can be most clearly represented with a language name.
A very good resource for understanding and planning the relationship between languages and scripts can be found here:
http://www.unicode.org/cldr/charts/latest/supplemental/languages_and_scripts.html
Language specification is defined by the ISO 639 specification. This is large and confusing. Browsers only take notice of ISO 639-1. The two letter code. There are 186 language codes in this list at writing time.
In HTML(5) the lang="xx" attribute value must be used. The HTML(5) lang="xx" attribute is a Global attribute. That means it can be used in all elements. In addition with Hebrew and Arabic scripts the dir="rtl" direction attribute is also global.
In addition bothlang and dir are fully inherited. A lang attribute at the top of a page will apply to all elements on the page unless there is a element with a different lang statement.
When packaging for ePub2/3 which is XHTML xml:lang="xx" and xml:dir="rtl" must be used. This is processed in automatically by IGP:Formats On Demand at packaging time.
Please not when using UTF-8 encoding the language attributes value is primarily in correct font rendering (especially with OTF font features) and hyphenation. If a reading device does not support hyphenation for a language it serves little purpose. Having stated that getting correct language attributes is essential for future-proofing content produced today.
Wikipedia Article http://en.wikipedia.org/wiki/ISO_639
If the requirement is for a given/fixed number of pages for an article, story or section it is important to understand the source language word count, plus characters per word (white spaces) and its length association with other languages for any particular subject.
When the objective is highly automated content production for print in many multiple languages reliable PDF layout and generation is essential. Typeset document length reliability starts with editorial length control.
Generally there is a relatively consistent character count per page or column on fully set pages. If the target is, for example, two pages per article with images and sidebars the number of words and word character count must be very consistent.
For example if an English source document fills available content flow space with 400 words at 5.6 characters per word, this word/character count must be editorial controlled ±20 or so characters. Some testing will need to be carried out to find the absolute upper word-count limit.
If there is no page count limits or rules other than standard typesetting widow/orphan and minimum section turn line counts this is less important.
The following table shows the same content, sourced in English and translated to German, Portuguese and Russian. The sample was 3,259 article words. The letter count includes spaces.
English | German | Portuguese | Russian | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Words | Chars | Ch/Wd | Wds | Chars | Ch/Wd | Words | Chars | Ch/Wd | Words | Chars | Ch/Wd |
3,259 | 19,287 | 5.9 | 3,861 | 23,448 | 6.1 | 3,128 | 19,526 | 6.2 | 3047 | 18,505 | 6.1 |
16% | 18% | -4% | 1% | -7% | -4% |
Word count indicates the number of whitespace linebreak opportunities without hyphenation. This, with Characters/Word also indicates that there are longer words in the language.
Rather obviously German is significantly longer than English in both word count and word length. Portuguese is has only a small character increase with notably fewer words, indicating more long words. Russian is notably more economic in characters and words than English but words are still longer.
This objective analysis of language for it's text length considerably speeds up typesetting decisions in an automated typesetting environment. With this example the same typesetting specification can be used for English, Portuguese and Russian. Applying cautious hyphenation to Portuguese may also be appropriate given the very long characters per word average. 4/3 would seem to be indicated.
German will ideally be hyphenated with a long character set such as 4/3; a slightly wider gutter; and a smaller font-face. Because it is nearly 20% longer, 100 lines could become between 110 and 120 lines with German. Line-height and count may also have to be adjusted as a last resort.
These instructions apply to both translation source language and target language documents.
It is essential that editing is carried out using fully Unicode compliant fonts without using combining diacritics. These will work in PDF creation but will definitely fail in e-book reading systems where the rendering engine typography is not as advanced.
Generally this is addressed using the standard Microsoft fonts or Open Source fonts. Commercial fonts or their derivatives are a dangerous minefield with combining diacritics. This is especially true with older TTF fonts.
It is equally important to ensure spaces between words are correct single soft spaces and not combinations of hard and soft spaces. Hard spaces will be preserved on manuscript import and result in unsatisfactory word spacing.
NOTE: If text is copied directly from typesetting applications there will inevitably be word joins and combining diacritics. These applications promote laziness in this operation and the content is not suitable for internationalization extraction.
When typesetting any text in any script there are a set of variables than can be applied for the optimization of presentation, length and line-count for any and every language. In order of importance and effect these are:
Font-family. It is important the font-family selected contains all the required characters for a particular language.
Galley-width. Galley and margins are very powerful in fine tuning the line-count length of a document. They can dramatically affect the end of line breaks. Obviously the more characters on a line (smaller font or wider galley) will give a better non-hyphenated line presentation.
Line-height. The balance between font-size and line-height is critical for the readability of any text. Line height is put first as font-size is a child of line-height based on the fact it is line-height and line-count that sets the baseline for any page.
Font-size. Font size has a significant effect on line-count and document length. Font-size changes as small as 0.25pt can have a significant effect depending on paragraphing.
Justification. This is applied for readibility and to create a professional looking document (based on house style perceptions). There are a number of languages that should not be justified due to the very long word lengths and no automatic hyphenation tools. Examples are South Indian languages. Consider using left alignment for these languages or there will be large and ugly spaces between words.
Hyphenation. Hypenation will general shorten the line count to paragraph turns. It has maximum effect on long paragraphs but is unlikely to pick up more than a line or two in short text. It is important that the correct lang code has been used and that the rendering system does support hyphenation for that language.
Word Spacing. Light word spacing can make a difference to the application of hyphens. It can be a double-edge sword and create spacing and hyphenation problems as well as save them.
Letter Spacing. This should be used as a last resort setting unless you have mild letter-spacing reduction as a house policy. However that is generally done as a reading quality issue rather than content shortening.