Do you feel like taking a sizable bite into useful functionality for WordForge?
These mini projects are mostly standalone that is that do not affect other parts of Pootle or the Translate Toolkit but are very useful tools for WordForge. If you would like to hack on one then please contact the team and we will assingn the projec to you.
Want something smaller? Try looking at our various TODO lists:
Not all of these are small, some are invasive, but many just need someone with the will and a bit of time. If you need help in choosing something then please ask on the mailing lists or on #pootle.
NOTE: please don’t list micro-projects here, rather point to them from here.
Both XLIFF and TMX can make ues of a standard for segmentation. What is segmentation? This is how you break up paragrpahs of text into sentences. It differs for languages and the stadard developed by LISA allows you to define segmentation rules for text.
the advantage of segmentation is that it create better reuse. If you think that in one text you might have a paragraph that contains a sentence that in another pirce of text is an exact match, but if we are not able to get to the setence level we will not see the match. So segmentation makes our Translation Memory much more usbale. And if it is part of XLIFF it makes matching in the actual translation even more usable.
Your job would be to implement the segmentation standard and integrate that into XLIFF and TMX as needed.
You have two pieces of text, the original and the translated text. But you do not have them combined in a bilingual translation file. Perhaps this is old work you inherited from someone else or you’ve found a source of good translations and you want be able to use them in your Translation Memory. You might only have the latest source document and an old translation so you don’t expect them to align completely.
In this case we need an alighnment tool. The tool should be able to read the files using our base classes and present the texts side by side, hopefully using the segmentation rules to make good guesses. The role of the user is to validate the alignment and to adjust it if needed.
The end result should be all the text items have been aligned or rejected.
The program then can outputs a new bilingual translation file eg XLIFF or PO or a TMX Translation Memory file.
(note: some initial work has been done on a tool as described here. See poglossary.)
When you translate you should start with a glossary of terms. Most glossary words are frequenty occuring in a body of text. But you might also have frequenty occuring phrases that you would want to translate differntly from the single words. The glossaries are then used by translators and reviewers to check translations and to ensure consistency.
The glossary extractor tool would look at a number of source files and extract candidate words and phrases. The user would be able to set the frequency levels eg how many times must it occur before we extract it, list of stop words, maximum phrase length etc.
The user should be able to eliminate words, check context in the originating text, pull online definitions and link them to the glossary entry or add their own clarification notes (this might be the role of a seperate glossary editor)
The output would be a TBX file or other file that can be imported into an application to populate the translations of the terms
Or glossary guesser. Use statistical techniques to take an empty glossary file and using your existign translation try to guess what might be a glossary word.
The simple case is the single word entry in translations. The harder case would be where the word occurs in a sentence or paragpraph
The toolkit provides a framework that allows you to define a storage format (eg. Gettext PO, .properties, etc) and allow a convertor to migrate translation between those and the bae formats (XLIFF, PO). The following are format that would be usefull to add to the Translate Toolkit. They are in no particular order, but we have limited them to ones that we regard as most useful
Take your favourite CMS and integrate localisation of the content (not the UI) by using Pootle.
Gettext is the home of PO format. It would be good if the Gettext tools could also do XLIFF. These are areas that need to be modified to allow full use of XLIFF. They are in the order of most important to least important.
Most Wikis, CMSs, general websites DO NOT do proper content negotiation. In this mini-project we are not concerned about the actual content but simply about the interface. It would be nice if for instance MediaWiki’s interface defaulted to the users preferred language when they view the site. Most of these systems allows people to specifiy their language when they sign in. But that is not enough.
This project would look at a few things, such as: