Its been a while since I managed to write something important on this blog. My blogging efforts these days are mostly concentrated around the day job, which by the way has been extremely enriching. I have had the opportunity to widen my perspective of global communities working with the multilingual digital world. People from diverse cultures come together in the Wikimedia projects to collaborate on social issues like education, digital freedom, tolerance of expression & lifestyles, health, equal opportunities and more. What better place to see this happen than at Wikimania – the annual conference of the Wikimedia movement. The 10th edition of the conference was held this year in London, UK. I was fortunate to participate and also present, along with my team-mate Kartik Mistry. This was our first presentation at a Wikimania.
Since the past few years, I have tried to publish the talking points from my presentations. This was my first major presentation in a long time. Kartik and I presented about the challenges we face everyday when testing applications, that our team creates and maintains for the 300+ languages in Wikimedia projects. We have been working actively to make our testing processes better equipped to handle these challenges, and to blend them into our development work-flow. The slides and the talking points are presented below. I will add the link to the video when its available. Feedback is most welcome.
As opposed to traditional testing methodologies an important challenge for testing internationalized applications is to verify the preciseness of the content delivered using them. When we talk about applications developed for different scripts and languages, the key functionalities like display, and input of content may require several levels of verification before it can be signed off as being adequately capable of handling a particular language. A substantial part of this verification process includes visual verification of the text, which requires extensive collaboration between the language speakers and the developers. For content on Wikimedia this can stretch to more than 300 languages for websites that are active or waiting in the incubator. In this session, we would like to present about our current best practices and solutions like tofu-detection – a way to identify if scripts are not being displayed, that can narrow down the long-drawn manual testing stages to address specific problems as they are identified. Talk Submission.
As we know, the mission of the Wikimedia Projects is to share the sum of all human knowledge. In the Wikimedia universe we have active projects in over 300 languages, while the multilingual resources have the capability to support more than 400 languages.
For using these languages we use extra tools and resources all the time (sometimes even without our knowledge). But these are not developed as widely as we would like them to be.
You may know them already…
Fonts, input methods, dictionaries, the different resources that are used for spell checking, grammar and everything else that is needed to address the special rules of a language. To make it work in the same way we can use English in most systems.
The applications that we develop to handle multilingual content are tested in the same way other applications are tested. The code sanity, the functionality and everything else that needs to be tested to validate the correctness of the design of the application is tested during the development process. (Kartik described this in some details.)
However, this is one part. The other part combines the language’s requirements to make sure that what gets delivered through the applications is what the language needs.
So the question we are trying to answer as developers is – my code works but does the content look good too?
At this point what becomes important is a visual verification of the content. Are the t-s being crossed and the i-s being dotted but in more complex ways.
Lets see some of the examples here to help explain better what we are trying to say:
There are always more interesting things that keep coming up. The takeaway from this is that, we haven’t yet found a way to escape manual tests when developing applications that are expected to handle multilingual content.
For now, what we have been trying to do is to make it less painful and more organised. Lets go over a checklist that we have been using as a guideline.
So that’s all we are currently doing to keep things organised. However, we would also like to explore options that can cut down this Herculean effort.
We had a blooper moment, when during the presentation we realised that the screenshot for Example 6 had been accidentally removed. We did not plan for it, but the audience got a glimpse of how manual tests can save the day on more serious occasions.
Over the years, many people have been syndicated on this planet and we have thoroughly enjoyed reading about their journey from being individuals enthusiastic about a project to growing as successful professionals, entrepreneurs, and mature contributors with ideas that define much of what the Indian FLOSS-world stands for today.
We wanted to celebrate this occasion with what PFI best stands for – blog posts. We would be super happy if anytime during the next few weeks you could write a post on your syndicated blog about being part of PFI all these years and how you would like to see the project evolve.
Do let us know if your blog has moved and we shall make the changes to get you back on the planet. Thanks for your support and do keep writing.
(Writing on behalf of the PFI team)
PFI is open for syndication all-year-round. So let us know if you would like your blog to be added to the planet.
The findings of the recently conducted survey of the languages of India under the aegis of the People’s Linguistic Survey of India have been the talking point since the past few days. The survey results are yet to be published in its entirety, but parts of it has been released through the mainstream media. The numbers about the scheduled and non-scheduled languages, scripts, speakers are fascinating.
Besides the statistics from the census, this independent survey has identified languages which are spoken in remote corners of the country and by as less as 4 people. From some of the reports that have been published, what one can gather is that there are ~780 languages and ~66 scripts presently in use in India. Of which the North Eastern states of India have the largest per capita density of languages and contribute with more than 100 (closer to ~200 if one sums things up) of those. It has also been known that in the last 50 years, ~250 languages have been lost, which I am assuming means that no more speakers of these languages remain.
This and some other things have led onto a few conversations around the elements of language diversity that creep into the everyday Indian life. Things that we assume for normal, yet are so diametrically varied from monolingual cultures. To demonstrate, we picked names of acquaintances/friends/co-workers and put 2 or more of them together to find what was a common language for each group. In quite a few cases we had to settle that English was the only language a group of randomly picked people could converse in. Well if one has been born in (mostly) urban India anytime onwards from the 1970s (or maybe even earlier), this wouldn’t be much of a surprise. The bigger cities have various degrees of cosmopolitan pockets. From a young age people are dragged through these either as part of their own social circle (like school) or their parents’. Depending upon the location and social circumstances English is often the first choice.
When at age 10 I had to change schools for the very first time, I came home open-mouthed and narrated to my mother that in the new school the children speak to each other in Bengali! Until that time, Bengali was the exotic language that was only spoken at home and was heard very infrequently on the telly on sunday afternoons. The conservative convent school where I went was a melting pot of cultures with students from local North East Indian tribes, Nepalis (both from India and Nepal), Tibetans, Chinese, Bhutanese and Indians from all possible regions where Government and Armed Forces personnel are recruited from. Even the kid next door who went to the same school spoke in English with me at school and in Bengali at the playground in the evening.
The alternative would be the pidgin that people have to practice out of necessity. Like me and the vegetable vendor in the sunday market. I don’t know her language fluent enough to speak (especially due to the variation in dialect), she probably hasn’t even heard of mine, and we both speak laughable hindi. What we use is part Hindi and part Marathi and a lot of hand movements to transact business. I do not know what I would do if I was living further south were Hindi is spoken much less. But it would be fun to try out how that works.
An insanely popular comic strip has been running since the past year – Guddu ang Gang, by Garbage Bin Studios. The stories are a throwback to our growing up years from the late 80s and 90s and touched so many chords on a personal level. The conversations are in Hindi, but the script they use is English. Like so many other thousands of people I have been following it and even purchased the book that came out. But maybe it wouldn’t have been the same amount of fun if the script was in Devanagari. I don’t read it fast enough. And no, in this case translating the text won’t make any sense. There is Chacha Chaudhary for that. Or even Tintin comics. Thanks to Anandamela, most people my age have grown up reading Tintin and Aranyadeb (The Phantom) comics in Bengali. There also exist juicy versions of Captain Haddock’s abuses.
Last year I gave a talk at Akademy touching on some of these aspects of living in a multi-cultural environment. TL&DR version: the necessities that requires people to embrace so many languages – either for sheer existence or for the fringes, and how we can build optimized software and technical content. For me, its still an area of curiosity and learning. Especially the balance between practical needs and cultural preservation.
** Note about the title: bolta – Hindi:’saying’, Bengali:’wasp’. Go figure!
Among the various conversations that happened over the last week, one that caught my attention was about having a standardized list of colour for translation glossaries. This has been on my mind for a long time and I have often broached the subject in various talks. Besides the obvious primary hurdle of figuring out how best to create this list, what I consider of more importance is the way such a reference set ought to be presented. For word based terminology, a standard mapping like:
key terms -> translated key terms (with context information, strongly recommended)
is easy to adopt.
However, for colours this becomes difficult to execute, for one very important reason. Colours have names which have been made to sound interesting with cultural or local references, nature (again maybe local or widespread) popular themes or general creativity. This makes it hard to translate. To translate colour names like ‘Salmon’ or ‘Bordeaux’ or <colour-of-sea-caused-by-local-mineral-in-the-water-no-one-outside-an-island-in-the-pacific-has-heard-of> one has to be able to understand what they refer to, which may be hard if one has never come across the fish or the wine or the water. To work around that, I have been using a 2-step method for a while which is probably how everyone does anyways (but never really talks about):
colour-name -> check actual colour -> create new name (unless of course its some basic colours like Red)
So, a natural progression in the direction of standardizing this would involve having the actual colour squeezed in somewhere as the context. Something on the lines of:
Colour Name -> Sample of the of colour -> Colour Translation
It would be good to have something like this set up for not just translation, but general reference.
Last saturday i.e. 29th December 2012 we had a translation sprint for Ankur India with specific focus on Gaia localization. The last few weeks saw some volunteers introducing themselves to participate in translation and localization. The Firefox OS seemed like a popular project with them that was also easy to translate. However, the reigning confusion with the tool of choice, was not easy to workaround. The new translators were given links to the files they could translate, and send over to the mailing list/mentor for review. Going back and forth in the review process was taking time and we quickly decided on the mailing list, a date to have a translation sprint. We used the IRC channel #ankur.org.in and gathered there from 11 in the morning to 4 in the evening. The initial hour was spent to set up the repository and to decide how we were going to manage the tasks between ourselves. Two of us had commit rights on the Mozilla mercurial repository. Of the 5 translators, two participants were very new to translation work, so it was essential to help them with constant reviews. By the end of the second hour, we were string crunching fast and hard, translators were announcing which modules they were picking (after some initial overlooking of this, prolly due to all the excitement) and then pushing them into the mercurial repository. We shut shop at the closing time, but had a clear process in place which allowed people to continue their work and continue the communication over email. All it needed was an IRC channel and a fundamental understanding of the content translation and delivery cycle.
What could have been nicer:
During the recently held Language Summit at Pune, we got an opportunity to discuss about a long standing issue related the localization process. Several discussions over various media have been constantly happening since the past couple of years and yet a clarity on the dynamics were sorely missed. A few months back a generic bug was also filed, which helped collate the points of these dispersed discussion.
Last week, we had Arky from Mozilla with us who helped us get an insight on how things currently stand in the Mozilla Localization front. Old hats like me who have been working on the localization of Mozilla products since a long time (for instance, I had started sometime around Firefox 1.x), had been initiated and trained to use the elaborate method of translation submissions using file comparision in the version control system. During each Firefox release, besides the core component there are also ancilliary components like web-parts that need to be translated on other Version control systems or through bugs. Thankfully there is now the shipping dashboard that lists some of these bits at one url.
However, recently there have been quite a few announcements from various quarters about Mozilla products being made available for translation through several hosts/tools – verbatim, narro, locamotion, even on transifex. Translators could gather files and submit translations via these tools, yet none of them deprecated the earlier method of direct submission into the servers through the Version Control Systems. The matter was much compounded with also a spate of translations coming in from new translators who were being familiarized with translation work at various local camps as part of Mozilla’s community outreach programs.
During the above mentioned session, we sought to find some clarity on this matter and also to understand the future plans that are being undertaken to reconcile the situation. Firstly, we created a list of all the tools and translation processes that are presently active.
1. Direct Submission into Mercurial or SVN
2. https://localize.mozilla.org/ – Aka ‘Verbatim‘ is essentially a version of pootle running on a server hosted by Mozilla. Used to translate web-parts, snippets, SUMO content etc.
6. Transifex – Primarily gaia project
7. Babelzilla – Mozilla Plug-ins
There could be more beyond the above.
The reason that was given for the existence of all these tools is to allow translators to choose a tool that they were ‘comfortable with‘. This however gives rise to quite a few complications involving syncing between these tools which evidently provide duplicate platforms for some of the projects and also about maintaining a trace of translations by the translation coordinators. Especially when the direct submission into VCS is still pretty much an open option for translators (coordinators) who may have not be aware of a parallel translation group working on the same project on another translation platform.
A new project called ‘ELMO‘ is aimed at rectifying this situation. This would host the top-level URI of the Mozilla Localization project, with direct links to each Language’s home page. The home page intends to list the Translation team details and urls for the projects. However, there is one big difference that seemed apparent: Unlike other Translation Projects which provide one umbrella translation team, each of the Mozilla products can have different Translation Teams and Coordinators, independent of each other. It may be a scaleable solution for manpower management, but leaves a big chance of product continuity going off-sync in terminology and translation. However, it may be a good idea to wait and see how Elmo fixes the situation.
Meanwhile there were a few action items that were fixed during the discussion (Thanks Arky!), these were:
1. A page on the mozilla wiki listing *all* the translation tools/hosts that are active and the projects that they host
2. Follow up on the discussion bug for “Process Modification”
3. A request to have automatic merging of strings modified in source content into the l10n modules in Mercurial (i.e. the strings identified via compare-locale). For instance, the comparision between the bn-IN and en-US module for the Aurora branch can be found here (cron output).
4. Explore the possibility to identify a consolidated Project calendar for all the Mozilla l10n projects. (Reference comment here)
As Arky mentioned during the discussion, there were plans that were already underway to implemention and I am quite excited to wait and see how things go. Some blogs or updates from the Mozilla L10n administration team would be really helpful and I hope those come in quick and fast.
…(Please leave a comment if I missed your name)