OmegaT+ Document Filters
Overview
Document filters are responsible for:
- reading an original document in from a file in a specific format (e.g. different filters exist for handling plain text and OpenDocument/OpenOffice);
- extracting translatable content from a document;
- writing the translated document out to a file (replacing translatable content of the original with its translation in the process).
Please use realistic settings when configuring filters or else unexpected results may occur. Read the instructions carefully and, if the results of manipulating the filters is not understood, stick with the default settings or work with a test project before using them on serious work. Ask questions on the OmegaT+ users group to get more help.
Take note that document filters can only be configured when there is no project open. To modify document handling for a particular project it must be done before the project is opened.
Filter Dialogs
The details for each of the dialogs and filter usage is given here.
Document Filters
Access to this dialog is gained by selecting Configure Document Filters... from the Settings menu.
Within the dialog a table of the available formats that can be handled is presented. The left column shows the available filter types. The right column shows a checkbox next to each filter type that indicates whether it is enabled (i.e. with a check mark) or not.
Currently available filter types are: Text, Portable Object (PO), Java Resource Bundles, HTML/XHTML, HTML Help Compiler, Key/Value Pairs, DocBook, OpenDocument/OpenOffice.org, Microsoft Open XML, XLIFF, Subtitles.
To edit a filter highlight it with the mouse and select the Edit... button to open the Edit Document Filter configuration dialog for it.
To reset the filters after changes have been made select the Defaults button. Select Apply for the reset to defaults to take effect or Cancel to back out of the reset. Be sure that resetting the filters to defaults is what is wanted before accepting the changes.
Edit Document Filters
Access this from the Document Filters dialog via the Edit... button.
The user is presented with a four column table for the filter type selected. The table contains the headings: Original Filename Pattern, Original Encoding, Translation Encoding, and Translation Filename Pattern for the configured settings of a particular filter. These are used for the following purposes:
- Original Filename Pattern - determines which documents in the originals directory(folder) will have the particular filter applied to them. Patterns can be customized according to the users' preferences.
- Original Encoding - the character encoding in effect for the original documents. Used to read in the original documents. Has a drop down list from which to choose encodings.
- Translation Encoding - the character encoding in effect for the translated documents. Used to write out the translated documents. Has a drop down list from which to choose encodings.
- Translation Filename Pattern - determines the pattern used to assign filenames to the translated documents created in translations directory. Patterns can be customized according to the users' preferences.
On the right side of the dialog are the buttons Add, Edit, and Remove that are used to open the Add Filter, Edit Filter, and Remove Filter dialogs, respectively
There is another Defaults button available in this dialog also. Use of this button will reset the filters for the filter type back to its original number of filters and default settings. Again, be sure this is exactly what is wanted before using it.
Add Filter & Edit Filter
Access these from the Edit Document Filters dialog via the Add and Edit buttons, respectively.
These dialogs and their instructions are the same, but their use is slightly different. User defined filters are created in the Add Filter dialog. A newly created filter is added to the list of available filters for the particular type of filter through this dialog. Default and user defined filters are edited in the Edit Filter dialog. A chosen filter is updated when changes are applied there.
Each dialog contains a number of entries that must be filled in to properly configure the settings of a filter. These correspond to the settings seen in the Edit Document Filters dialog. Select Apply to add a new filter or accept changes to an existing filter, or Cancel to back out of the particular operation.
Original Filename Pattern
Specify the pattern used to the determine which documents in the originals directory will have the particular filter applied to them. Patterns can be customized according to the users' preferences and within the limits of normal shell (glob) patterns.
Enter the appropriate original filename pattern in the text area. There is a default pattern set here which depends on the particular filter under consideration. Use the default or set a custom one according to the specific original documents that are to be used with the filter.
Original Encoding
Specify the original document character encoding. This can be determined externally and entered (e.g. known ahead of time, from OpenOffice.org, etc.), by the file extension for some preset document formats (i.e. plaintext in Latin-1 uses *.txt1) or, if applicable, the <auto> setting can be used. For some formats the ability to change the encoding will be disabled. See the Encodings section on this page for details on the <auto> and disabled settings.
Select the appropriate encoding from the drop down list, or use <auto> for a filter to be automatically selected to read original documents in with.
Translation Encoding
Specify the translated document character encoding. The translated documents will be generated with the selected encoding. It may be a good idea to know the encoding required ahead of time for the particular translation locale. If applicable, the <auto> setting can be used to attempt to automatically set the encoding. Again, for some formats the ability to change the encoding will be disabled. See the Encodings section on this page for details on the <auto> and disabled settings.
Translation Filename Pattern
Specify the pattern used to assign filenames to the translated documents generated in the translations directory. Patterns can be customized according to the users' preferences, within the limits of normal shell (glob) patterns.
Enter the appropriate translated filename pattern in the text area. The default pattern set here depends on the particular filter under consideration. Use the default or set up a custom pattern to alter the filenames of the translated documents for this filter. There are a number of preset filename variables available for this purpose.
Filename Variables
For the Translation Filename Pattern there are few preset filename
variables that can be used to help create the filenames of the translated
documents. The syntax of these variables is ${VARIABLE}
and
follows from commonly known glob patterns in shell (command line) usage.
A variable is used in line in the Translation Filename Pattern
text area to insert the filename variables' values that are determined from
the documents and settings in a project.
Select a filename variable from the drop down list. Use the Insert button to put a selected variable at the cursor location in the Translation Filename Pattern text area.
Variable | Description |
---|---|
${filename}
|
original document filename (with extension) |
${nameOnly}
|
original document filename (without extension) |
${extension}
|
original document filename extension |
${sourceLanguage}
|
original language/locale (xx-YY or xx_YY) |
${targetLocale}
|
translation locale (xx_YY) |
${targetLanguage}
|
translation language (xx-YY) |
${targetLanguageCode}
|
translation language code (xx) |
${targetCountryCode}
|
translation country code (YY) |
${filename}
is the default configuration for the
Translation Filename Pattern in most cases, which means
that the original document filename and translated document filenames
will be the same.
Remove Filter
This is a confirmation dialog only. Select Apply to permanently remove a filter. User defined filters will not be available after confirming their removal. The default filters for a particular document format will also not be available if removed, but these can be restored from the Edit Document Filters dialog by selecting Defaults.
Encodings
Encodings for original and translation are selectable from the drop down lists in the Edit document Filters, Add Filter, and Edit Filter dialogs. The encodings available are limited to those in the Java Runtime Environment in operation. In addition, an <auto> variable can be used to attempt to automatically detect the original document encoding required.
<auto> Setting
Depending upon the particular filter in use, one of three actions will be taken when an <auto> setting is encountered:
- OmegaT+ will attempt to determine the encoding automatically; applies to original documents only.
- the default encoding of the operating system will be used.
- no action will occur. This happens when the filter only supports a single original and translation encoding.
Disabled(Grayed Out) Setting
The particular filter with the disabled variable does not support multiple document encodings or the document format is encoding-neutral. In which case there is no way to specify another encoding.
Example: Translation Filename Pattern
Perhaps it would be nice to change the translated document filenames to reflect the particular translation language or locale being translated to. One possibility is to add a suffix to the filenames (before the extension). For instance, if the translation language is French, then the addition of the language code might be good.
In this case the pattern could be
${nameOnly}-fr.${extension}
,
but there is already a variable for the translation language code so insert that variable instead to give
${nameOnly}-${targetLanguageCode}.${extension}
.
Now, if an original document was named test.txt (e.g. under the *.txt text filter) the translated document filename for it would become test-fr.txt.
Note that the filename separator '.' is not inserted by using ${extension}, it must be manually included in the pattern if you want it.
Continuing, perhaps the country code would be nice also. It could be added with
${nameOnly}-${targetLanguageCode}-CA.${extension}
for the case of French as the translation language and Canada as the country. In this case test.txt would become test-fr-CA.txt.
In this case, there is already a variable for the translation country code available so insert that variable to give
${nameOnly}-${targetLanguageCode}-${targetCountryCode}.${extension}
Much more simply for the case at hand, just insert the translation language in place of the translation language code and translation country code filename variables.
${nameOnly}-${targetLanguage}.${extension}
The result is the same in the last two cases as before. In this example the translated filename now has the language (language plus country code) tacked on as a suffix to all of the document filenames before the extension, the extension is unaltered.
Questions & Answers
How is the filter that works with a document determined?
This is determined by a document's filename pattern. Each filter lists the original document filename pattern of the documents it can handle. This can be set by adding or editing a document filter. Each filter may have a distinct pattern associated with it. For example, if the plaintext filter is to be used to handle all documents without an extension, add or edit a filter to have a *. original filename pattern. There are also a few preset filters that work with a specific document type, but use different encodings. For example, a Latin-1 encoded plaintext document can be recognized by the .txt1 extension. All documents can be given a particular pattern ahead of using them in a project so they will automatically be recognized. In the example, the extension was the determining factor for associating documents with a filter.
What if a document encoding seems incorrect?
Set up another encoding instead by changing the Original Encoding of the appropriate document filter.
What can be done if a single encoding does not work well for the original and translation languages?
In reality, UTF-8, UTF-16, and UTF-32 encodings will work in almost all cases. It is possible to use one of these encodings for original and translation together. To use two different encodings for the original and translation, ensure that the appropriate encodings are chosen for each character set and that the original documents are saved in the proper encoding before opening the project.