Category Archives: planetarium

Testing Multilingual Applications – Talk Summary from Wikimania 2014

Its been a while since I managed to write something important on this blog. My blogging efforts these days are mostly concentrated around the day job, which by the way has been extremely enriching. I have had the opportunity to widen my perspective of global communities working with the multilingual digital world. People from diverse cultures come together in the Wikimedia projects to collaborate on social issues like education, digital freedom, tolerance of expression & lifestyles, health, equal opportunities and more. What better place to see this happen than at Wikimania – the annual conference of the Wikimedia movement. The 10th edition of the conference was held this year in London, UK. I was fortunate to participate and also present, along with my team-mate Kartik Mistry. This was our first presentation at a Wikimania.

Since the past few years, I have tried to publish the talking points from my presentations. This was my first major presentation in a long time. Kartik and I presented about the challenges we face everyday when testing applications, that our team creates and maintains for the 300+ languages in Wikimedia projects. We have been working actively to make our testing processes better equipped to handle these challenges, and to blend them into our development work-flow. The slides and the talking points are presented below. I will add the link to the video when its available. Feedback is most welcome.

Talk Abstract

As opposed to traditional testing methodologies an important challenge for testing internationalized applications is to verify the preciseness of the content delivered using them. When we talk about applications developed for different scripts and languages, the key functionalities like display, and input of content may require several levels of verification before it can be signed off as being adequately capable of handling a particular language. A substantial part of this verification process includes visual verification of the text, which requires extensive collaboration between the language speakers and the developers. For content on Wikimedia this can stretch to more than 300 languages for websites that are active or waiting in the incubator. In this session, we would like to present about our current best practices and solutions like tofu-detection – a way to identify if scripts are not being displayed, that can narrow down the long-drawn manual testing stages to address specific problems as they are identified. Talk Submission.

Slides

Talk Summary

Slide 2

As we know, the mission of the Wikimedia Projects is to share the sum of all human knowledge. In the Wikimedia universe we have active projects in over 300 languages, while the multilingual resources have the capability to support more than 400 languages.

For using these languages we use extra tools and resources all the time (sometimes even without our knowledge). But these are not developed as widely as we would like them to be.

You may know them already…

Slide 3

Fonts, input methods, dictionaries, the different resources that are used for spell checking, grammar and everything else that is needed to address the special rules of a language. To make it work in the same way we can use English in most systems.

Slide 4

The applications that we develop to handle multilingual content are tested in the same way other applications are tested. The code sanity, the functionality and everything else that needs to be tested to validate the correctness of the design of the application is tested during the development process. (Kartik described this in some details.)

Slide 5

However, this is one part. The other part combines the language’s requirements to make sure that what gets delivered through the applications is what the language needs.

So the question we are trying to answer as developers is – my code works but does the content look good too?

Slide 6

At this point what becomes important is a visual verification of the content. Are the t-s being crossed and the i-s being dotted but in more complex ways.

Lets see some of the examples here to help explain better what we are trying to say:

  • Example 1 and Example 2 : Fonts entirely missing. Displays Tofu or blocks
  • Example 3: Partially available Text – makes it hard to understand what the User Interface wants you to do
  • Example 4: Input Methods on Visual Editor doesn’t capture the sequence of typed characters
  • Example 5: The otherwise functional braces lose their position when used with RTL text
  • Example 6: Dependent Vowels in complex scripts appear broken with a particular font

Slide 13

There are always more interesting things that keep coming up. The takeaway from this is that, we haven’t yet found a way to escape manual tests when developing applications that are expected to handle multilingual content.

Slide 14

For now, what we have been trying to do is to make it less painful and more organised. Lets go over a checklist that we have been using as a  guideline.

  1. Standard Tests – These are the tests that the developers are doing all the time. Unit tests etc. Its part of the development plans.
  2. Identify must-check items – Once you are through with the standard tests, try to identify the issues and checks that are most important for some languages of a similar type or individual languages. For instance, in languages with complex scripts you may want to check some combinations that should never break.
  3. Note the new and recurring bugs – This list should by no means be rigid. If during tests there are problems that seem to recur or new bugs of major impact surface, add them into your test set of must-checks so that you are aware that these need to be tested again when you make the next release.
  4. Predictable regression tests – The idea is to keep the regression tests organised to some extent so that you don’t miss the really important things.
  5. Ad-hoc testing – However, by no means should the hunt for hidden bugs be stopped. Explore as far as you can. However, you may have to be a little careful because you might find a really ugly bug, and may not remember how to ended up there. So retracing your steps can be a challenge, but that shouldn’t be a major blocker. Once you find it, you can note it down.
  6. Track the results – For precisely this purpose we keep the tests that we regularly want to do in a test tracking system. We use TestLink, where you can organise the tests, the steps that a user can follow and the expected results. Success and failures can be noted and tests can be repeated very easily across releases.
  7. Seek expert help – However, the two most important things to keep in mind is to make sure that you speak to native speakers of the language and maybe to an expert, if you are already a native speaker. There may be situations where your understanding of a language will be challenged. For instance, ancient scripts may have to be tested for projects like WikiSource, and it may even be unfamiliar for regular users of the modern version of the script.
  8. Testing Environments – Secondly, make sure you have stable testing environments in place where people can come and test the applications on their own time

So that’s all we are currently doing to keep things organised. However, we would also like to explore options that can cut down this Herculean effort.

Contact Us


We had a blooper moment, when during the presentation we realised that the screenshot for Example 6 had been accidentally removed. We did not plan for it, but the audience got a glimpse of how manual tests can save the day on more serious occasions.

Planet FLOSS India – 10th anniversary

Its been 10 years since Planet Floss India (PFI) was started by Sayamindu and Sankarshan.

Over the years, many people have been syndicated on this planet and we have thoroughly enjoyed reading about their journey from being individuals enthusiastic about a project to growing as successful professionals, entrepreneurs, and mature contributors with ideas that define much of what the Indian FLOSS-world stands for today.

We wanted to celebrate this occasion with what PFI best stands for – blog posts. We would be super happy if anytime during the next few weeks you could write a post on your syndicated blog about being part of PFI all these years and how you would like to see the project evolve.

Do let us know if your blog has moved and we shall make the changes to get you back on the planet. Thanks for your support and do keep writing.

(Writing on behalf of the PFI team)

PFI is open for syndication all-year-round. So let us know if you would like your blog to be added to the planet.

Building a Standardized Colour Reference Set

Among the various conversations that happened over the last week, one that caught my attention was about having a standardized list of colour for translation glossaries. This has been on my mind for a long time and I have often broached the subject in various talks. Besides the obvious primary hurdle of figuring out how best to create this list, what I consider of more importance is the way such a reference set ought to be presented. For word based terminology, a standard mapping like:

key terms -> translated key terms (with context information, strongly recommended)

is easy to adopt.

However, for colours this becomes difficult to execute, for one very important reason. Colours have names which have been made to sound interesting with cultural or local references, nature (again maybe local or widespread) popular themes or general creativity. This makes it hard to translate. To translate colour names like ‘Salmon’ or ‘Bordeaux’ or <colour-of-sea-caused-by-local-mineral-in-the-water-no-one-outside-an-island-in-the-pacific-has-heard-of> one has to be able to understand what they refer to, which may be hard if one has never come across the fish or the wine or the water. To work around that, I have been using a 2-step method for a while which is probably how everyone does anyways (but never really talks about):

colour-name -> check actual colour -> create new name (unless of course its some basic colours like Red)

So, a natural progression in the direction of standardizing this would involve having the actual colour squeezed in somewhere as the context. Something on the lines of:

Colour Name -> Sample of the of colour -> Colour Translation

like:

Salmon         salmon          ইঁট

It would be good to have something like this set up for not just translation, but general reference.

Translation Sprint for Gaia

Last saturday i.e. 29th December 2012 we had a translation sprint for Ankur India with specific focus on Gaia localization. The last few weeks saw some volunteers introducing themselves to participate in translation and localization. The Firefox OS seemed like a popular project with them that was also easy to translate. However, the reigning confusion with the tool of choice, was not easy to workaround. The new translators were given links to the files they could translate, and send over to the mailing list/mentor for review. Going back and forth in the review process was taking time and we quickly decided on the mailing list, a date to have a translation sprint. We used the IRC channel #ankur.org.in and gathered there from 11 in the morning to 4 in the evening. The initial hour was spent to set up the repository and to decide how we were going to manage the tasks between ourselves. Two of us had commit rights on the Mozilla mercurial repository. Of the 5 translators, two participants were very new to translation work, so it was essential to help them with constant reviews. By the end of the second hour, we were string crunching fast and hard, translators were announcing which modules they were picking (after some initial overlooking of this, prolly due to all the excitement) and then pushing them into the mercurial repository. We shut shop at the closing time, but had a clear process in place which allowed people to continue their work and continue the communication over email. All it needed was an IRC channel and a fundamental understanding of the content translation and delivery cycle.

SUMMARY

Participants:

  • Biraj Karmakar (biraj),
  • Priyanka Nag (priyanka_nag),
  • Runa Bhattacharjee (arrbee/runa_b)
  • Samrat Bhattacharya (samratb),
  • Sayak Sarkar (sayak)

Translation Statistics:

  • At Mozilla Dashboard – 39% translated. (Does not include the files still being reviewed)

What worked:

  • Communication was live
  • Faster turnaround of translation -> reviews -> revision
  • Queries were resolved faster
  • Commits were immediately made into the repository
  • Workflow was established to ensure the committers were being notified of files ready to go into the repository
  • No overlapping of translation

What could have been nicer:

  • A simpler tool to track the translation, through *one* interface. (Discussed many times earlier, and comments can be directed to the earlier post)
  • Pre-decided work assignments to start things off (this was rather hastily put up)
  • More time

Follow up:

  • There is still more to do and the translation has to continue. Not just for Gaia, but for other projects as well.
  • A review session for all the translated content. Besides catching errors and omissions of various nature, this can be of particular benefit to the new translators who can gauge the onscreen context of the content that they had to blindly translate

Mozilla L10n – Discussion Notes

During the recently held Language Summit at Pune, we got an opportunity to discuss about a long standing issue related the localization process. Several discussions over various media have been constantly happening since the past couple of years and yet a clarity on the dynamics were sorely missed. A few months back a generic bug was also filed, which helped collate the points of these dispersed discussion.

Last week, we had Arky from Mozilla with us who helped us get an insight on how things currently stand in the Mozilla Localization front. Old hats like me who have been working on the localization of Mozilla products since a long time (for instance, I had started sometime around Firefox 1.x), had been initiated and trained to use the elaborate method of translation submissions using file comparision in the version control system. During each Firefox release, besides the core component there are also ancilliary components like web-parts that need to be translated on other Version control systems or through bugs. Thankfully there is now the shipping dashboard that lists some of these bits at one url.

However, recently there have been quite a few announcements from various quarters about Mozilla products being made available for translation through several hosts/tools – verbatim, narro, locamotion, even on transifex. Translators could gather files and submit translations via these tools, yet none of them deprecated the earlier method of direct submission into the servers through the Version Control Systems. The matter was much compounded with also a spate of translations coming in from new translators who were being familiarized with translation work at various local camps as part of Mozilla’s community outreach programs.

During the above mentioned session, we sought to find some clarity on this matter and also to understand the future plans that are being undertaken to reconcile the situation. Firstly, we created a list of all the tools and translation processes that are presently active.

 

1. Direct Submission into Mercurial or SVN

2. https://localize.mozilla.org/ – Aka ‘Verbatim‘ is essentially a version of pootle running on a server hosted by Mozilla. Used to translate web-parts, snippets, SUMO content etc.

3. mozilla.locamotion.org – Hosted by the http://translate.sourceforge.net group, and runs an advanced version of pootle. Used to translate Firefox, main.lang etc.

4. Narro – The Narro tool that allows translations of Firefox, components of Thunderbird, Gaia etc

5. Pontoon Project – To localize web content. More details from the developers here.

6. Transifex – Primarily gaia project

7. Babelzilla – Mozilla Plug-ins

There could be more beyond the above.

The reason that was given for the existence of all these tools is to allow translators to choose a tool that they were ‘comfortable with‘. This however gives rise to quite a few complications involving syncing between these tools which evidently provide duplicate platforms for some of the projects and also about maintaining a trace of translations by the translation coordinators. Especially when the direct submission into VCS is still pretty much an open option for translators (coordinators) who may have not be aware of a parallel translation group working on the same project on another translation platform.

A new project called ELMO is aimed at rectifying this situation. This would host the top-level URI of the Mozilla Localization project, with direct links to each Language’s home page. The home page intends to list the Translation team details and urls for the projects. However, there is one big difference that seemed apparent: Unlike other Translation Projects which provide one umbrella translation team, each of the Mozilla products can have different Translation Teams and Coordinators, independent of each other. It may be a scaleable solution for manpower management, but leaves a big chance of product continuity going off-sync in terminology and translation. However, it may be a good idea to wait and see how Elmo fixes the situation.

Meanwhile there were a few action items that were fixed during the discussion (Thanks Arky!), these were:

1. A page on the mozilla wiki listing *all* the translation tools/hosts that are active and the projects that they host

2. Follow up on the discussion bug for “Process Modification”

3. A request to have automatic merging of strings modified in source content into the l10n modules in Mercurial (i.e. the strings identified via compare-locale). For instance, the comparision between the bn-IN and en-US module for the Aurora branch can be found here (cron output).

4. Explore the possibility to identify a consolidated Project calendar for all the Mozilla l10n projects. (Reference comment here)

As Arky mentioned during the discussion, there were plans that were already underway to implemention and I am quite excited to wait and see how things go. Some blogs or updates from the Mozilla L10n administration team would be really helpful and I hope those come in quick and fast.

Attendees:

Arky
Amir Aharoni
Sayak Sarkar
Ankit Gadgil
Ani Peter
Sweta Kothari
Jaswinder Singh
Rajesh Ranjan
Shankar Prasad
Nilamdyuti Goswami
Shantha Kumar
Manoj Giri
Krishnababu Krothrapalli
…(Please leave a comment if I missed your name)

“The Sun Goes Around The Earth”

“THE SUN GOES AROUND THE EARTH”

If one grew up in the city of Kolkata in the 1980s and 90s, they would not be unfamiliar with the above graphiti planted on innumerable walls and lamposts. The graphiti and the adamant proponent of this theory is a legend that a generation would remember.

I was reminded of this, by a rather unfortunate turn of events that happened, on a mailing list of much repute. Just this morning, I was speaking with a colleague about how often and unknowingly we are drawn into stressful situations which make us lose focus from the task at hand. After having responded to a mail thread now crossing the 80+ mark, I wanted to step back, summarize and review this entire situation.

It all started when someone, who by his own admission is not a native speaker of Bangla/Bengali language, wanted to transcribe Sanskrit Shlokas (hymns) in the Bangla script into a digital format and requested for modifications in a in-use keymap. To what final end, is however unclear. This is not an unusual practice as there are numerous books and texts of Sanskrit that have been written in the Bengali script and this effort can be assumed as a natural progression to digitizing texts of this nature. What stands out is the unusual demand for the addition of a certain character, which is not part of the Bengali script, into a Bengali keymap (much in use) that this gentleman wanted to use to transcribe them. The situation worsens with more complications because this character is not a random one and belongs to the Assamese script.

The character in question is the Assamese character RA, written as ৰ and has the Unicode point U+09F0. This is part of the Unicode chart for the Bengali script, which is used to write Bengali, Assamese, and Manipuri (although Meitei is now the primary script for Manipuri). Although exclusively used for Assamese, this character does have a historical connection with the Bengali script. ৰ was also used as the Bengali character RA before the modern form র (Unicode point U+09B0) came into practice. At which exact point of time this change happened is somewhat unclear to me, but references to both the forms can be found as early as 1778 when Nathaniel Brassey Halhed published the A Grammar of the Bengali Language. Dr.Fiona Ross‘ extensively researched The Printed Bengali Character: Its Evolution contains excerpts from texts where the ancient form of র i.e. ৰ has been used. However, this is not the main area of concern.

Given its pan-Indian nature, Sanskrit has been written in numerous regional scripts. I remember, while at school Sanskrit was a mandatory third language of study. The prescribed book for the syllabus used the Devanagari script. On the other hand, the Sanskrit books that I saw in my home were in the Bengali script (some of my ancestors, including my maternal Grandfather were Priests and Sanskrit teachers who had their own tol). Anyway, I digress here. The main concern is around the two characters of ‘BA‘ and ‘VA‘ . In Devanagari, ‘BA‘ i.e. and ‘VA‘ i.e. are two very distinct characters with distinct pronunciations. While ‘BA is used for words that need a pronunciation such as बालक (phonetic: baa-lak), ‘VA is used for words such as विद्या (phonetic:weedh-ya). In Bengali, these two variations are respectively known as ‘Borgiyo BA‘ and ‘Antastya BA‘. However, unlike Devanagari they do not have separate characters. So both of them are represented by (U+09AC in the Unicode chart). Earlier they held two different positions in the alphabet chart, but even that has been relinquished. The pronunciation varies as per the word, a practice not dissimilar to the behaviourial aspects of the letters, ‘C‘ and ‘T‘ in English.

This is where it starts getting muddled. The gentleman in question requests for a representation of the Devanagari equivalent of the separation of BA and VA, for Bengali as well. Reason stated was that the appropriate pronunciations of the Sanskrit words were not possible without this distinction. So as a “solution” he suggested the use of the Assamese RA glyph in place of the Borgiyo BA sounds and the Bengali BA to be reserverd exclusively for the lesser used Antastya BA i.e. VA sounds. Depicted below as a diagram for ease of reference.

On the basis of what legacy this link is to be established or how the pronunciation for the two characters have been determined, meets a dead end in the historical references of the Bengali script[1].

To support his claims he also produces a set of documents[1][2] which proudly announces itself as the “New Bengali character set” (নূতন বর্ণপরিচয়/Nutan Barnaparichay) at the top of the pages. The New Bengali character set seems quite clandestine and no record of it is present in the publications from the Paschimbanga Bangla Academy, Bangla Academy Dhaka or any of the other organisations that are considered as significant contributors for the development and regulation of the language. Along with the New character set, there are also scanned images from books where the use of this character variation can be seen. However the antecedents of these books have not been clearly identified. In one of them, the same word (বজ্র) has been spelt differently in two sentences, which imho adds more confusion to the melee.

On my part, I have also collected some excerpts from Sanksrit content written in Bengali, with particular emphasis on the use of ব. Among them is one from the almanacs (ponjika) which are widely popular amongst householders and priests in everyday reference of religious shlokas and hymns.

The character in the eye of the storm i.e. the Assamse RA and its Bengali counterpart are very special characters. These form two different conjuncts each with the ‘YA’ (U+09AF that is shared by both the scripts) without changing the sequence of the characters:

র + য = র্য
র + য = র‍্য (uses ZWJ)

ৰ + য = ৰ্য
ৰ + য = ৰ‍্য (uses ZWJ)

The Bengali character set as we know it today was created by Ishwar Chandra Bidyasagar, in the form of the বর্ণপরিচয়/Barnaparichay written by him. Since much earlier, the script also saw modern advancements mostly to cater to the requirements of the printing industry. His reforms added a finality to this. The বর্ণপরিচয়/Barnaparichay still remains as the first book that Bengali children read while learning the alphabets. This legacy is the bedrock of the printed character and, coupled with grammar rules, defines how Bengali is written and used since the last 160 years. The major reform that happened after his time was the removal of the character ঌ (U+098C) from everyday use. Other than this, the script has remain unchanged. In such a situation, a New Barnaparichay with no antecedents and endorsements from the governing organisations cannot shake the solid foundations of the language. The way the language is practised allows for some amount of liberty mostly in terms of spellings mainly due to the legacy and origins of the words. Some organisations or publication houses prefer to use the conservative spellings while others recommend reforms for ease of use. The inevitable inconsistencies cannot be avoided, but in most cases, the system of use is documented for the reader’s reference. Bengali as a language has seen a turbulent legacy. An entire nation was created from a revolution centered around the language.

During this entire fiasco the inputs from the Bengali speaking crowd (me included) were astutely questioned. Besides the outright violation of the Bengali script, complications arising out of non-standard internationalized implementations which were highlighted, were waived off. What is more disappointing is the way the representatives from IndLinux handled the situation. As one of the pioneering organisations in the field of Indic localization they have guided the rest of the Indic localization groups in later years. With suggestions for implementing the above requests in the Private Use Area of the fonts (which maybe a risky proposition if the final content, font and keymap are widely distributed) and providing customized keymaps they essentially risked undoing critical implementational aspects of the Bengali and Assamese internationalization. Whether or not the claims from the original requestor are validated and sorted, personally I am critically concerned about the advice that was meted out (and may have also been implemented) by refuting the judgement of the Bengali localization teams without adequate vetting.

Note:A similar situation was seen with the Devanagari implementation of Kashmiri. Like the Bengali Unicode chart, the Devanagari chart caters to multiple languages including Hindi, Marathi, Konkani, Maithili, Bodo, Kashmiri and a few others. Not all characters are used for all the languages. While implementing Kashmiri, a few of the essential characters were not present in the Devanagari chart. However, similar looking characters were present in the Gurumukhi chart and were used while writing Kashmiri. This was rectified through discussions with Unicode, and the appropriate code points were alloted in the Devanagari chart for exclusive use in Kashmiri.

Translation – dive in!

The reason I started writing this post is the recent rise in the interest towards things related to translation and localization. Everywhere one turns to there is someone evangelising this revolution from atop a soapbox and gathering people around for quick win localization projects. It may be reasonable to question if I consider this innundation of localizers as an unhappy turn of events. Hardly. After having toiled alone for ages, at times through uncharitable sneers it is indeed a welcome change. However, I have some grave reservations about how this is being done.

Off-late there has been a rising impetus on forming geography based communities around some of the significant (eye-ball grabbing) FOSS projects. With the proliferation of the projects’ user base this is a natural progression in the scheme of things. When communities are based on geographies one of the first things they tend to find commonality in is their language. Thus, enter localization. So far so good. However, this is where the slightly disruptive butterfly starts to flutter its wings.

The localization projects are also a major entry point for new contributors to be lured into the projects. It has forever been a perception that translation was the easiest way to start contributing to any open-source project. And why not? Everyone seemed to be able to read and comprehend English – the original language used in most components and the same ‘everyone’ also knew how to read and write the language that they were going to translate into. Fair enough, come join. All Hail Crowdsourcing!!

This is where the fluttering starts to get serious. Most of these localization projects were not new discoveries. Depending upon the maturity of their localization sub-projects, there are established norms of translation, review, terminology and validation, including certain methods to groom new translators. Teams are formed around a language to ensure that translations are consistently updated and polished to attain a high degree of consistency and perfection. Conventions evolve and rules honoured.

Does that make it difficult for new entrants to join? Marginally, yes. But then which other projects do not have this barrier. If it is acceptable for projects to validate and audit codes before accepting them, why should localized content be considered an open field for experiements. Especially, when compared to codes the latter is far more difficult to trace and rectify.

The following is an excerpt from an interview with Sue Gardner, Director of the Wikimedia Foundation, where she answers a query about whether new contributors were finding it difficult to work their way around the policies:

We queried her take on this second area, pointing out that all publishers that aim to present high-quality information find they need complex rules, whether explicit or via accepted standards of writing and scholarship. Could she give specific examples of areas where we could simplify policy without sacrificing standards?

Yes, the premise of this question is absolutely correct. The analogy I often use is the newsroom. Anybody who’s curious and reasonably intelligent can be a good journalist, but you do need some orientation and guidance. Just like a newsroom couldn’t invite in 100 random people off the street and expect them to make an immediate high-quality contribution, neither can Wikipedia expect that.”

What most of these populist programs tend to miss are the percolations that are felt elsewhere. For languages with large amount of published localized content that have been filtered through long periods of (mostly) manual validation, experiments on ancilliary components introduce inconsistency and worse, errors. For instance, non-validated translations in add-on components ruin the user-interface of the main component. Which in most cases is an extremely prominent application and often part of enterprise level products. These errors can be resolved by the usual bug tracking systems, but how does one chase up volunteers who had turned up for localization sprints and have moved on?

Crowdsourcing is here to stay. So will crowdsourced contributions. With more flexibility in translation tools, the new age translators do not have to go through the rigourous grooming process that were prevalent until a few years back and has shaped a lot of the veteran translators.They can get their contributions into the main projects without any delay. Often with the blessings of the sponsoring project who do not have to wait for their translation assets to multiply and their local communities to expand. With some amount of experience both as a translator and as a homemaker, the one thing that I can vouch for is that technical translation is not unlike housework – everyone has an opinion oh how easy it is but you don’t know how many corners you end up cleaning until you are down on your knees doing it.