Vous êtes sur la page 1sur 6

Transliteration

User's Guide
http://sites.google.com/site/bhashaime/
Send feedback and feature requests to
bhashaime@gmail.com
Author : Venkatesh

Table of Contents
1

Introduction..........................................................................................................................................3

Usage....................................................................................................................................................3

3 Text Format preservation (for RTF source text)...................................................................................3


4 Text with Multiple fonts.......................................................................................................................4
5

Single-click conversion of all fonts.....................................................................................................4

Batch conversion (Windows Explorer & Left-Ctl + Menu-Invoke)....................................................4

User configurable parameters..............................................................................................................5

Preferred Word processors...................................................................................................................5

Preferred PDF readers for text extraction............................................................................................5

10

Essential Tweaks................................................................................................................................5
10.1 To Force specific transliteration scheme (Left-Shift & Menu Invoke).......................................5
10.2 Extract Fonts from PDFs (strictly for understanding font encoding).........................................6
10.3 Speed-up IME start-up................................................................................................................6

Introduction

The IME has built in transliteration engine for conversion of non-Unicode text Unicode text.
Although a few Unicode non-Unicode conversions are also provided, the engine is primarily
designed for the former. The rules of conversion for each encoding are provided in separate T_xxx.txt
files under the T_ folder.

The format of T_ files is provided below. Users can create/modify the T_ files for fonts not already
supported by the IME or customizing the existing rules.

IME can handle and convert plain text (eg., text typed in Notepad, Notepad++ etc.) and RTF (Rich
Text Format) formatted text (eg. text created with WPS Writer, MS Word etc., or text extracted from
PDF file using Foxit etc.).

Usage

The conversion mechanism is via clipboard. Copy the source text to clipboard and invoke conversion
from Tray-menu. A pop-up menu reports progress and completion. After completion, the converted
text would be available on the clipboard which can be pasted into any editor.

Note: If source text is RTF, the converted text would be only in RTF. Consequently, this can be
pasted into any RTF-aware editor like WPS, MS Word etc. Plain editors like Notepad do not recognize
RTF and cannot paste the text.
If the source text is plain-text, the converted text can be pasted into any editor including Notepad.

Text Format preservation (for RTF source text)

RTF format covers all the format details supported by MS office 2007. The IME preserves the RTF
format intact and absolutely. The converted text therefore has exactly the same format as the source
text; these include the font size, style (bold/italic), tables, borders, header, footer, footnotes, endnotes
etc.
Note: MS Word created RTF texts are, many a times, not suitable for conversion. eg. continuous textruns are broken up in the RTF, resulting in eg. instead of after conversion. It's preferable to
use WPS Writer in such cases. Or, the source text can be copied to a plain text editor first and then
copied to clipboard for conversion, to do away with RTF; all format being lost ofcourse.

Text with Multiple fonts

IME can recognize texts with multiple fonts and apply conversions selectively to text-runs with
specific font type. IME requires the text to be in RTF. Applications like WPS suite, MS Office,
LibreOffice, OpenOffice can provide text in RTF format.
Open font-aware files eg. rtf, doc, docx odt, in a suitable editor, select & copy text to clipboard and
invoke menu by turn for each source font used in the file.
Note: Some fonts are available in TTF and PFB formats. Both use different encodings but same font
name. eg. SHREE-DEV-0708.ttf and SHREE-DEV-0708.pfb. There is no way IME can discern between
ttf and pfb. Using one conversion for the other results in unintelligible text. When this results, try the
alternate conversion.

Single-click conversion of all fonts

IME provides a menu All (Non-Unicode) Unicode for multiple-font conversion in one go.
Invoke this with RTF text on clipboard. The IME recognizes all fonts in the text and invokes
respective conversions. This is fast and easy way of handling multiple font conversion.

Note: With this invoke, the font conversion cannot be fine-controlled; eg IME uses SHREE-DEV0708(PFB) Uni Dev for converting text in SHREE-DEV-0708 font because, lexically, this occurs
before SHREE-DEV-0708(TTF) Uni Dev. This cannot be changed to SHREE-DEV-0708(TTF)
Uni Dev simply.
If the later is desired, move the file T_SHREE-DEV-07(SO7xxxxx.PFB)_2_Udev.txt in the T_ folder,
out of T_ folder (say a sub-folder temp within T_) and re-start the IME. The IME does not see
SHREE-DEV-0708(PFB) Uni Dev and uses SHREE-DEV-0708(TTF) Uni Dev.

Batch conversion (Windows Explorer & Left-Ctl + Menu-Invoke)

Open Windows explorer, multi-select all files to be converted and with Left-Control key pressed,
invoke the conversion menu. Keep the key pressed till a pop-up appears in the system tray, after
which the key can be released. The IME converts all the files selected and creates new files with _U
suffix. The source files are unchanged.
IME can convert .txt, .doc, .docx files in batch. .doc and .docx files require WPS Office or MS Office to
have been installed. IME opens these documents in WPS/MS, converts them to RTF converts and
saves the result as an .rtf file.
WPS is most preferred since it provides the most amenable RTF. IME can automatically detect WPS
and MS; tries to use WPS first, on fail, MS.
The .txt files containing Unicode characters (Devanagari/IAST/ISO etc.) should be encoded with
BOM; IME may not properly recognize text within files encoded without BOM.

User configurable parameters

There are a few of parameters user can configure to tune-up the conversion. eg. Select
superscript/subscript for Dev Tamil. These can be set in User_Config.txt file in the T_ folder.

Preferred Word processors

WPS writer best preserves the RTF format. For some unknown (to me) reason, MS Word tries to alter
the font name of some characters/text-runs. This hampers conversion. Opening rtf, doc, docx files in
WPS writer and copying text to clipboard produces best results mostly.

Preferred PDF readers for text extraction

PDF Files with multiple fonts are a common place. Each text-run with a specific font requires a
different conversion. This kind of text needs to be extracted in a font-preserving format eg. RTF.
Free Foxit Reader can best preserve font information for PDF text extraction. The copied text would
be available in RTF on the clipboard. IME's transliteration can be invoked right-away following text
copy from Foxit reader.
Rarely, Foxit extraction is imperfect, resulting in improper font info and consequently improper
conversion. Closing and restarting Firefox should fix the problem.
Rarely, Foxit extracted font info is erroneous and is not fixed by re-start. Open the file in PDFXchange Viewer. Select the text in question, right-click, click Text Properties. This opens up a
dialog. Select Formatting under Categories, expand the + under Text Formatting. This should
show-up the actual font.
Alternately, open the file in PDF-Xchange Editor, copy, right-click and select Copy as a Rich Text
and paste into a Word processor. The true font info should be visible in the Word processor.
If PDF text is encoded with a single font and format is not important, PDF-Xchange Viewer/Editor,
can be used for plain text extraction. AbleWord is another reliable s/w for plain text extraction.
Apache's PDFBox is another descent tool for plain text extraction.

10 Essential Tweaks
10.1 To Force specific transliteration scheme (Left-Shift & Menu Invoke)
Sometimes the text extracted from PDF is designated font names which are apparently different from
real font names. This causes the converted text to be incomprehensible.

Eg. Foxit Reader extracts text from Chandralok.pdf (encoded by Rashtriya Samskrita Samsthan),
under fonts Narad-Normal and Narad-Bold. Converting them by invoking Narad Uni Dev or
ALL (non-Unicode) Unicode produces unintelligible text.

Foxit Extracted Text:

Jh gfj%

Transliterated text:

Knowing that the actual encoding is per Shiva font, the IME can be forced to use Shiva Uni Dev
by copying the text to Clipboard in Foxit, keeping Left-Shift key pressed and invoking Shiva
Uni Dev. IME ignores font information available on clipboard and forces Shiva Uni Dev
conversion, producing intelligible output.
Left-Shift + Shiva Uni Dev transliterated text:

10.2 Extract Fonts from PDFs (strictly for understanding font encoding)
MuPDF is an excellent font extractor. It extracts fonts even from PDF where most other tools like
Fontforge, Fontmatrix, and online tools fail partially or even completely. It can extract embedded ttf
and cff fonts.
Fontmatrix is also a good tool for well formed PDFs. It can only extract embedded ttf fonts.

10.3 Speed-up IME start-up


IME reads-up all the T_*.txt and K_*.txt files under the T_ and K_ folders respectively. This takes
time. If you are sure you do not need any of the conversions provided by T_*.txt and K_*.txt files,
move them into a sub folder under T_ and K_ folders resp. DO NOT move TH_*.txt and KH_*.txt files
as they do not, by themselves, delay IME start up.

Vous aimerez peut-être aussi