Tag Archives: l10n

Testing Multilingual Applications – Talk Summary from Wikimania 2014

Its been a while since I managed to write something important on this blog. My blogging efforts these days are mostly concentrated around the day job, which by the way has been extremely enriching. I have had the opportunity to widen my perspective of global communities working with the multilingual digital world. People from diverse cultures come together in the Wikimedia projects to collaborate on social issues like education, digital freedom, tolerance of expression & lifestyles, health, equal opportunities and more. What better place to see this happen than at Wikimania – the annual conference of the Wikimedia movement. The 10th edition of the conference was held this year in London, UK. I was fortunate to participate and also present, along with my team-mate Kartik Mistry. This was our first presentation at a Wikimania.

Since the past few years, I have tried to publish the talking points from my presentations. This was my first major presentation in a long time. Kartik and I presented about the challenges we face everyday when testing applications, that our team creates and maintains for the 300+ languages in Wikimedia projects. We have been working actively to make our testing processes better equipped to handle these challenges, and to blend them into our development work-flow. The slides and the talking points are presented below. I will add the link to the video when its available. Feedback is most welcome.

Talk Abstract

As opposed to traditional testing methodologies an important challenge for testing internationalized applications is to verify the preciseness of the content delivered using them. When we talk about applications developed for different scripts and languages, the key functionalities like display, and input of content may require several levels of verification before it can be signed off as being adequately capable of handling a particular language. A substantial part of this verification process includes visual verification of the text, which requires extensive collaboration between the language speakers and the developers. For content on Wikimedia this can stretch to more than 300 languages for websites that are active or waiting in the incubator. In this session, we would like to present about our current best practices and solutions like tofu-detection – a way to identify if scripts are not being displayed, that can narrow down the long-drawn manual testing stages to address specific problems as they are identified. Talk Submission.

Slides

Talk Summary

Slide 2

As we know, the mission of the Wikimedia Projects is to share the sum of all human knowledge. In the Wikimedia universe we have active projects in over 300 languages, while the multilingual resources have the capability to support more than 400 languages.

For using these languages we use extra tools and resources all the time (sometimes even without our knowledge). But these are not developed as widely as we would like them to be.

You may know them already…

Slide 3

Fonts, input methods, dictionaries, the different resources that are used for spell checking, grammar and everything else that is needed to address the special rules of a language. To make it work in the same way we can use English in most systems.

Slide 4

The applications that we develop to handle multilingual content are tested in the same way other applications are tested. The code sanity, the functionality and everything else that needs to be tested to validate the correctness of the design of the application is tested during the development process. (Kartik described this in some details.)

Slide 5

However, this is one part. The other part combines the language’s requirements to make sure that what gets delivered through the applications is what the language needs.

So the question we are trying to answer as developers is – my code works but does the content look good too?

Slide 6

At this point what becomes important is a visual verification of the content. Are the t-s being crossed and the i-s being dotted but in more complex ways.

Lets see some of the examples here to help explain better what we are trying to say:

  • Example 1 and Example 2 : Fonts entirely missing. Displays Tofu or blocks
  • Example 3: Partially available Text – makes it hard to understand what the User Interface wants you to do
  • Example 4: Input Methods on Visual Editor doesn’t capture the sequence of typed characters
  • Example 5: The otherwise functional braces lose their position when used with RTL text
  • Example 6: Dependent Vowels in complex scripts appear broken with a particular font

Slide 13

There are always more interesting things that keep coming up. The takeaway from this is that, we haven’t yet found a way to escape manual tests when developing applications that are expected to handle multilingual content.

Slide 14

For now, what we have been trying to do is to make it less painful and more organised. Lets go over a checklist that we have been using as a  guideline.

  1. Standard Tests – These are the tests that the developers are doing all the time. Unit tests etc. Its part of the development plans.
  2. Identify must-check items – Once you are through with the standard tests, try to identify the issues and checks that are most important for some languages of a similar type or individual languages. For instance, in languages with complex scripts you may want to check some combinations that should never break.
  3. Note the new and recurring bugs – This list should by no means be rigid. If during tests there are problems that seem to recur or new bugs of major impact surface, add them into your test set of must-checks so that you are aware that these need to be tested again when you make the next release.
  4. Predictable regression tests – The idea is to keep the regression tests organised to some extent so that you don’t miss the really important things.
  5. Ad-hoc testing – However, by no means should the hunt for hidden bugs be stopped. Explore as far as you can. However, you may have to be a little careful because you might find a really ugly bug, and may not remember how to ended up there. So retracing your steps can be a challenge, but that shouldn’t be a major blocker. Once you find it, you can note it down.
  6. Track the results – For precisely this purpose we keep the tests that we regularly want to do in a test tracking system. We use TestLink, where you can organise the tests, the steps that a user can follow and the expected results. Success and failures can be noted and tests can be repeated very easily across releases.
  7. Seek expert help – However, the two most important things to keep in mind is to make sure that you speak to native speakers of the language and maybe to an expert, if you are already a native speaker. There may be situations where your understanding of a language will be challenged. For instance, ancient scripts may have to be tested for projects like WikiSource, and it may even be unfamiliar for regular users of the modern version of the script.
  8. Testing Environments – Secondly, make sure you have stable testing environments in place where people can come and test the applications on their own time

So that’s all we are currently doing to keep things organised. However, we would also like to explore options that can cut down this Herculean effort.

Contact Us


We had a blooper moment, when during the presentation we realised that the screenshot for Example 6 had been accidentally removed. We did not plan for it, but the audience got a glimpse of how manual tests can save the day on more serious occasions.

Hum bolta ko bolta bolta hain…

The findings of the recently conducted survey of the languages of India under the aegis of the People’s Linguistic Survey of India have been the talking point since the past few days. The survey results are yet to be published in its entirety, but parts of it has been released through the mainstream media. The numbers about the scheduled and non-scheduled languages, scripts, speakers are fascinating.

Besides the statistics from the census, this independent survey has identified languages which are spoken in remote corners of the country and by as less as 4 people. From some of the reports[1][2] that have been published, what one can gather is that there are ~780 languages and ~66 scripts presently in use in India. Of which the North Eastern states of India have the largest per capita density of languages and contribute with more than 100 (closer to ~200 if one sums things up) of those. It has also been known that in the last 50 years, ~250 languages have been lost, which I am assuming means that no more speakers of these languages remain.

This and some other things have led onto a few conversations around the elements of language diversity that creep into the everyday Indian life. Things that we assume for normal, yet are so diametrically varied from monolingual cultures. To demonstrate, we picked names of acquaintances/friends/co-workers and put 2 or more of them together to find what was a common language for each group. In quite a few cases we had to settle that English was the only language a group of randomly picked people could converse in. Well if one has been born in (mostly) urban India anytime onwards from the 1970s (or maybe even earlier), this wouldn’t be much of a surprise. The bigger cities have various degrees of cosmopolitan pockets. From a young age people are dragged through these either as part of their own social circle (like school) or their parents’. Depending upon the location and social circumstances English is often the first choice.

When at age 10 I had to change schools for the very first time, I came home open-mouthed and narrated to my mother that in the new school the children speak to each other in Bengali! Until that time, Bengali was the exotic language that was only spoken at home and was heard very infrequently on the telly on sunday afternoons. The conservative convent school where I went was a melting pot of cultures with students from local North East Indian tribes, Nepalis (both from India and Nepal), Tibetans, Chinese, Bhutanese and Indians from all possible regions where Government and Armed Forces personnel are recruited from. Even the kid next door who went to the same school spoke in English with me at school and in Bengali at the playground in the evening.

The alternative would be the pidgin that people have to practice out of necessity. Like me and the vegetable vendor in the sunday market. I don’t know her language fluent enough to speak (especially due to the variation in dialect), she probably hasn’t even heard of mine, and we both speak laughable hindi. What we use is part Hindi and part Marathi and a lot of hand movements to transact business. I do not know what I would do if I was living further south were Hindi is spoken much less. But it would be fun to try out how that works.

An insanely popular comic strip has been running since the past year – Guddu ang Gang, by Garbage Bin Studios. The stories are a throwback to our growing up years from the late 80s and 90s and touched so many chords on a personal level. The conversations are in Hindi, but the script they use is English. Like so many other thousands of people I have been following it and even purchased the book that came out. But maybe it wouldn’t have been the same amount of fun if the script was in Devanagari. I don’t read it fast enough. And no, in this case translating the text won’t make any sense. There is Chacha Chaudhary for that. Or even Tintin comics. Thanks to Anandamela, most people my age have grown up reading Tintin and Aranyadeb (The Phantom) comics in Bengali. There also exist juicy versions of Captain Haddock’s abuses.

Last year I gave a talk at Akademy touching on some of these aspects of living in a multi-cultural environment. TL&DR version: the necessities that requires people to embrace so many languages – either for sheer existence or for the fringes, and how we can build optimized software and technical content. For me, its still an area of curiosity and learning. Especially the balance between practical needs and cultural preservation.

** Note about the title: bolta – Hindi:’saying’, Bengali:’wasp’. Go figure!

Building a Standardized Colour Reference Set

Among the various conversations that happened over the last week, one that caught my attention was about having a standardized list of colour for translation glossaries. This has been on my mind for a long time and I have often broached the subject in various talks. Besides the obvious primary hurdle of figuring out how best to create this list, what I consider of more importance is the way such a reference set ought to be presented. For word based terminology, a standard mapping like:

key terms -> translated key terms (with context information, strongly recommended)

is easy to adopt.

However, for colours this becomes difficult to execute, for one very important reason. Colours have names which have been made to sound interesting with cultural or local references, nature (again maybe local or widespread) popular themes or general creativity. This makes it hard to translate. To translate colour names like ‘Salmon’ or ‘Bordeaux’ or <colour-of-sea-caused-by-local-mineral-in-the-water-no-one-outside-an-island-in-the-pacific-has-heard-of> one has to be able to understand what they refer to, which may be hard if one has never come across the fish or the wine or the water. To work around that, I have been using a 2-step method for a while which is probably how everyone does anyways (but never really talks about):

colour-name -> check actual colour -> create new name (unless of course its some basic colours like Red)

So, a natural progression in the direction of standardizing this would involve having the actual colour squeezed in somewhere as the context. Something on the lines of:

Colour Name -> Sample of the of colour -> Colour Translation

like:

Salmon         salmon          ইঁট

It would be good to have something like this set up for not just translation, but general reference.

Akademy 2012 Talk Transcript – Localizing software in Multi-cultural environments

The following is my talk transcript from Akademy 2012. During the talk I wavered quite a bit from this script, but in the end I managed to cover most of the major bits that I wanted to talk about. Either way, this is the complete idea that I wanted to present and the discussions can continue at other places.


Good morning. The topic for my talk this morning is Localizing software in Multi-cultural environments. Before I start, I’d like to quickly introduce myself. For most of my talks, I include a slide at the very end with my contact details. But after the intense interactive sessions I forget to mention them. I did not want to make that mistake this time. My name is Runa and I am from India. This is the first time I am here in Estonia and at Akademy. For most of my professional life, I have been working on various things related to Localization of software and technical documentation. This includes translation, testing, internationalization, standardization, also on various tools and at times I try to reach out about the importance of localization and why we need to start caring for the way it is done. That is precisely why I am here today to talk about how localization can have hidden challenges and why it is important that we share knowledge and experience on how we can solve them.

Before I start the core part of the discussion today, I wanted to touch base on why localization of software is becoming far more important now. (I was listening to most of the talks in this room yesterday. And it was interesting to note that a continuous theme that reappeared in most of the talks was about finding ways to simplify adjusting to a world of growing devices and information). These days there is a much larger dependence on our devices for communication, basic commercial needs, travel etc. These could be our owned devices or the ones at public spaces. It is often assumed to be an urban requirement, but with improvement in communication technology this is not particularly the case. Similarly, the other concept is that the younger generation is more accustommed to the use of these devices, but again that is changing – out of compulsion or choice.

The other day I was watching this series on BBC about the London Underground. And there was this segment about how some older drivers who had been around for more than 40 years opted out of service and retired when some new trains were introduced and they did not feel at ease with the new system. Now I am not familiar with the consoles and cranks in the railway engines but for the devices and interfaces that we deal with, among other things localization is one major aspect that we can use to help make our interfaces easy. We owe it to the progress that we are imposing in our lives.

The reason I chose to bring this talk to this conference was primarily for the fact that it was being held here, in Europe. In terms of linguistic and cultural diversity, India by itself perhaps has as much complexities as the entire continent of Europe put together. However, individual countries and cultural groups in Europe depict a very utopian localization scenario, which may or may not be entirely correct. I bring this utopian perspective here as a quest, which I am hoping will be answered during this session through our interactions. I’ll proceed now to describe the multi-cultural environment that I and most of my colleagues in India work in.

Multi-cultural structure:

Firstly, I’d like to tell here that I do not use the term multi-cultural from any anthropological reference. Instead it is a geopolitical perspective. Present day India is divided into 28 states and 7 union territories, and the primary basis for this division is … well ‘languages’. I’d like to show you a very simple pictoral representation of how it essentially is at the ground level.

The big circle here is our country and the smaller ones are the states.Each of the states has a predominant population of the native language speakers. Some states may even have multiple state languages with equally well distributed population. At this point, I’d like to mention that India has 22 languages recognised by the Indian constitution for official purposes, with Hindi and English being considered the primary official languages. The latter being a legacy from the British era. The individual states have the freedom to choose the additional language or languages that they’d like to use for official purposes and most states do have a 3rd and sometimes 4th official language. So the chances are that if you land up at a place where Hindi is not the primary language of communication, you’d see the public signs written in a minimum of 3 languages. Going back to our picture, I have marked the people in each of these states in their distinctive colour. They have their own languages and their own regional cultures.

However, essentially that is not the status quo that is in place. So we have people moving away from their home states to other states. Why? Well, first for reasons of their employment in both government and private sector jobs. Education. Business. Defence personnel and various other common enough reasons. And given its a country we are talking about, people have complete freedom to move about without additional complications of visas or residence permits. So in reality the picture is somewhat like this.

The other multi-cultural grouping is when languages cross geographical borders. Mostly due to colonial legacy or new world political divisions and migration, some languages exist in various places across the world.

Like Spanish or French and closer home for me, my mother tongue Bengali that is spoken in both India and Bangladesh. In these cases, the languages in use often take the regional flavours and create their own distinctive identity to be recognised as independent dialects – like Brazilian Portuguese. While sometimes they do stay true to their original format to a large extent – as practised by the Punjabi or the Tamil speaking diaspora.

While discussing the localization scenario, I’ll be focusing on the first kind of multi-cultural environment i.e. multiple languages bound together by geographical factors so that they are forced to provide some symmetry in their localized systems.

Besides the obvious complexities with the diversity, how exactly does this complicate matters on the software localization front? To fully understand that, we would need to first list the kind of interfaces that we are dealing with here.

In public spaces we have things like ATM machines, bank kiosks, railway and airline enquiry kiosks, ticketing machines, while on the individual the front we have various applications on desktop computers, tablets, mobile phones, handheld devices, GPS systems etc. If you were here during the talk Sebas talk yday afternoon, the opening slide had the line devices are the new ocean. Anyways, some of these applications are of personal use, while some others may be shared, for instance, in the workplace or in educational institutes. In each of these domains, when we encounter a one to many diversity ratio the first cookie that crumbles is standardization

Language is one of the most fundamental personal habits that people grow up with. A close competitor is a home cooked meal. Both are equally personal and people do not for a moment consider the fact that there could be anything wrong with the way they have learnt it.

Going back to the standardization part, two sayings very easily summarize the situation here, one in Hindi and the other in Bengali:

1. do do kos mein zuban badal jaati hain i.e. the dialect in this land changes in every 2 kos (about 25 miles)

2. ek desher buli onyo desher gaali i.e. harmless words in one language maybe offensive when used in another language

So there is no one best way of translating that would work well for all the languages. The immediate question that would come to mind is, why is there a need to find a one size fits all solution?

These are independent languages after all. While it does work well independently to a large extent, but there are situations where effective standardization is much in demand.

For instance, in domains which pan across the diversity. Like national defence, census records, law enforcement and police records, national identity documents etc.
Complications arise not just to identify the ideal terminology but a terminology that can be quickly suited to change. The major obstacle comes from the fact that a good portion of these technonological advancements were introduced much before Indian languages were made internationalization-ready. As a result, the users have become familiar with the original english terminology in a lot of cases. There were also people who knew the terms indirectly, perhaps someone like a clerk in the office who did not handle a computer but regularly needed to collect printouts from the office printer. So when localized interfaces started to surface, besides the initial mirth, they caused quite a bit of hindrance in everyday work. Reasons ranged from non-recognition of terminology to sub-standard translations. So we often get asked the question that when English is an official language and does cut across all the internal boundaries, why do you need to localize at all? It is a justified query. Especially in a place like India, which inherited English from centuries of British rule. However, familiarity with a language is not synonymous to comfort. A good number of people in the work force or in the final user group have not learnt English as their primary language of communication. What they need is an interface that they can read faster and understand quickly to get their work done. In some cases, a transliterated content has been known to work better than a translated one.

The other critical factor comes from an inherited legacy. Before indepedence, India was dotted with princely states and kingdoms and autonomous regions. They often had their own currency and measurement systems, which attained regional recognition and made way into the language for that region. A small example here.

In Bengali, the word used to denote a currency is called Taka. So although the Indian currency is the Rupee, when it is to be denoted in Bengali the word Rupee is completely bypassed and Taka used instead. So 1 Rupee is called ek taka in Bengali. When we say Taka in Bengali in India, we mean the Indian Rupee (symbol ₹). But as an obvious choice, Taka (symbol ৳) has been adopted as the name for the currency of Bangladesh. So if a localized application related to finance or banking addresses the term for currency as Taka, a second level of check needs to be done to understand which country’s currency is being talked about here and then the calcuations are to be done. To address issues of this nature often we have translations for the same language being segregated into geographical groups, mostly based on countries.

Mono-cultural structure:

This is where I start describing the utopian dream. I do not use the word mono-cultural as an absolute, but would like to imply a predominantly mono-cultural environment. As opposed to the earlier complexity, a lot of places here in Europe are bound by homegenuity of a predominant language and culture. Due to economic stability and self-sufficiency the language of the land has been the primary mode of communication, education, and administration. There does not arise a need for a foreign language to bind the people in various parts of the country. If you know your language, you can very well survive from childhood to oldage. The introduction of new technology in localized versions completely bypassed any dependency on the initial uptake through English. Without a baggage of inherited cross-cultural legacy and bound through a commonality of technology intergrated lifestyle, the terminology was stablizied much faster for adoption. So if you knew how to use an ATM machine in one city, you could most likely be able to use another one just the same in another city. Thats probably the primary reason why various applications are translated much faster in these languages with a much higher user base. Regional differences aside, a globally acknowledged version of the language is available and not difficult to understand .

How do we deal with the problems that we face in multi-cultural places:

The first thing would probably be to accept defeat about a homegeneous terminology. It would be impractical. But that doesn’t stop one from finding suitable workarounds and tools to deal with these complexities.

1. Collaboration on translations
2. Tools that facilitate collaboration
3. Simplify the source content
4. Tools for dynamic translation functionalities
5. Learn from case studies
6. Standardize on some fronts

Collaboration on translations – When translating if you come across a term or phrase that you personally struggled to translate or think may pose a problem for other translators, it would be reasonable to leave messages on how to interpret them. Often highly technical terms or terms from a different culture/location are unknown or hard to relate to and instead of all translators searching for the term individually , a comment from another translator serves as a ready reckoner. The information that can be passed in this way are: description of the term or phrase, and how another language has translated it so that other translators of the same language or of a closely related language can identify quickly how to translate it.

Tools that facilitate collaboration – To collaborate in this manner, translators often do not have any specific tools or formats to leave their comments. So when using open source tools, translators generally have to leave these messages as ‘comments’. Which may or may not be noticed by the next translators. Instead, it is beneficial if the translation tools allow for cross-referencing across other languages as a specialized feature. I believe the proprietory translation tools do possess such features for collaboration.

Simplify the source content – However, until such features are intergrated or collaborative practices adopted, a quick win way to easier translation is to get back to the source content creators for explanations or requests to simplify their content. The original writers of the user interface messages try to leave their creative stamp on the applications. Which may include cleverly composed words, simplified words, new usage of existing words, local geographic references, colloquial slang, analogies from an unrelated field or even newly created terms which do not have parallel representations in other languages. Marta had mentioned the a similar thing yesterday during her talk – where she said that humour in commit messages should ideally be well understood by whoever is reading them. If taken as a creative pursuit, translators have the liberty to come up with their version of these creations. However, when we are looking at technical translations for quick deployments, the key factor is to make it functional. So while the translators can reach out to the content creators, the original content creators could also perhaps run a check before they write their content, to see if it will be easy to translate.

Tools for dynamic translation functionalities – Before coming here to Estonia, I had to read up some documents related to visa etc. which were not available in English. The easiest way to get a translated version of the text from German was through an online translation platform. Due to the complexity of Indian languages, automatic translations tools for them have not yet evolved to the same levels of accuracy as we can otherwise see for European languages. But availability of such tools would help benefit societies like ours, where people do move around a lot. Going back to an earlier example for a ticket booking kiosk, lets assume a person has had to move out of their home state and is not proficient in either of the two official languages or in the local language of the home-state. In such a case, our users would benefit if the application on the kiosk has a feature so that interfaces for additional languages can be generated as per requirements, either from existing translated content or dynamically. This is from interface display. The other part is to allow simplified writing applications like phonetic and transliteration keyboards for writing complex scripts quickly.

Standardize on some fronts – However, standardization is a key element that cannot be overlooked at all. As a start, terminology related to the basic functional areas where content is shared across languages need to be pre-defined so that there are no chances of discrepancy and even auto-translation functions can be quickly implemented.

Learn from case studies – And ofcourse nothing beats learning from existing scenarios of similar nature. For instance a study on how perhaps the Spanish and Italian translation teams collaborated while working on some translations may be applied somewhat effectively for languages with close similarities like Hindi and Marathi.

Conclusion

Whether in a multi-cultural environment or otherwise, localization is here to stay. With the users of various applications growing everyday, the need for their customizations and ease of use will simultaneously grow. And like any other new technology, the importance lies in making the users confident in using them. Nothing better to boost confidence than providing them with an interface that they can find their way around on.

In Agustin’s keynote yesterday afternoon, he mentioned that there is a need for patience to instill confidence during these times of fast moving technology. At a discussion some time back, someone had suggested to do away with written content on the interface and to only retain icons. Ideally, written content can never be completely removed. But yes they can be made easier to use. Sebas had shared a similar thought yesterday that technology should be made functional for user’s need and not because it was fun developing it.

A few months back the Government of India sent out a circular to its adminsitrative offices that in place of difficult Hindi words, the usage of Hinglish or a mix of English and Hindi could be used to ease the uptake of the language. I wholeheartedly shared this view and had followed up with a blog post on this where I mentioned that:

Familiar terms should not be muddled up, and Readability of the terms is not compromised,

primarily to ensure that terminology is not lost in translation when common issues are discussed across geographies, especially in the global culture of the present day that cuts across places like multinational business houses and institutes of higher education.

Akademy 2012

Akademy – the annual summit of the KDE community is happening at Tallinn, Estonia this year.  It’ll be the first time I’ll be attending this conference. The schedule for this 7 day summit has talks, sessions and workshops and what I am guessing will be a lot of exciting interactions. I’ll be presenting as well and my talk is about ‘Localizing Software for Multicultural Environment‘. Its on the 1st of July and if you are a translator, write documents, develop software, use localized environments and are also attending Akademy, do please try to head to Room 1 that day. I am planning to have this session as a comparitive study session in most parts, with me presenting about localization in a multi-cultural environment and gathering the perspective from non-multi-cultural translation groups. The talk transcript will be available here on my blog right after the talk. However, if there are any questions that you’d like me to address during the talk, please do let me know over email or through the comments.

Thanks to the the Akademy team for the invitation and sponsorship. Looks like these would be days very well spent.

Translation – dive in!

The reason I started writing this post is the recent rise in the interest towards things related to translation and localization. Everywhere one turns to there is someone evangelising this revolution from atop a soapbox and gathering people around for quick win localization projects. It may be reasonable to question if I consider this innundation of localizers as an unhappy turn of events. Hardly. After having toiled alone for ages, at times through uncharitable sneers it is indeed a welcome change. However, I have some grave reservations about how this is being done.

Off-late there has been a rising impetus on forming geography based communities around some of the significant (eye-ball grabbing) FOSS projects. With the proliferation of the projects’ user base this is a natural progression in the scheme of things. When communities are based on geographies one of the first things they tend to find commonality in is their language. Thus, enter localization. So far so good. However, this is where the slightly disruptive butterfly starts to flutter its wings.

The localization projects are also a major entry point for new contributors to be lured into the projects. It has forever been a perception that translation was the easiest way to start contributing to any open-source project. And why not? Everyone seemed to be able to read and comprehend English – the original language used in most components and the same ‘everyone’ also knew how to read and write the language that they were going to translate into. Fair enough, come join. All Hail Crowdsourcing!!

This is where the fluttering starts to get serious. Most of these localization projects were not new discoveries. Depending upon the maturity of their localization sub-projects, there are established norms of translation, review, terminology and validation, including certain methods to groom new translators. Teams are formed around a language to ensure that translations are consistently updated and polished to attain a high degree of consistency and perfection. Conventions evolve and rules honoured.

Does that make it difficult for new entrants to join? Marginally, yes. But then which other projects do not have this barrier. If it is acceptable for projects to validate and audit codes before accepting them, why should localized content be considered an open field for experiements. Especially, when compared to codes the latter is far more difficult to trace and rectify.

The following is an excerpt from an interview with Sue Gardner, Director of the Wikimedia Foundation, where she answers a query about whether new contributors were finding it difficult to work their way around the policies:

We queried her take on this second area, pointing out that all publishers that aim to present high-quality information find they need complex rules, whether explicit or via accepted standards of writing and scholarship. Could she give specific examples of areas where we could simplify policy without sacrificing standards?

Yes, the premise of this question is absolutely correct. The analogy I often use is the newsroom. Anybody who’s curious and reasonably intelligent can be a good journalist, but you do need some orientation and guidance. Just like a newsroom couldn’t invite in 100 random people off the street and expect them to make an immediate high-quality contribution, neither can Wikipedia expect that.”

What most of these populist programs tend to miss are the percolations that are felt elsewhere. For languages with large amount of published localized content that have been filtered through long periods of (mostly) manual validation, experiments on ancilliary components introduce inconsistency and worse, errors. For instance, non-validated translations in add-on components ruin the user-interface of the main component. Which in most cases is an extremely prominent application and often part of enterprise level products. These errors can be resolved by the usual bug tracking systems, but how does one chase up volunteers who had turned up for localization sprints and have moved on?

Crowdsourcing is here to stay. So will crowdsourced contributions. With more flexibility in translation tools, the new age translators do not have to go through the rigourous grooming process that were prevalent until a few years back and has shaped a lot of the veteran translators.They can get their contributions into the main projects without any delay. Often with the blessings of the sponsoring project who do not have to wait for their translation assets to multiply and their local communities to expand. With some amount of experience both as a translator and as a homemaker, the one thing that I can vouch for is that technical translation is not unlike housework – everyone has an opinion oh how easy it is but you don’t know how many corners you end up cleaning until you are down on your knees doing it.

Indic Typing Booster – Bengali

My colleagues Pravin Satpute and Anish Patil have been working for sometime on a cool tool called the Indic Typing Booster. The premise for this tool is to aid users new to typing in Indian languages. Using a normal US English keyboard (i.e. the widely available generic keyboard around here) users begin typing a word in a keyboard sequence of their choice and after a couple of key presses the typing booster prompts the user with a series of words that match the initially typed in key sequences.

For instance, if the user wanted to type the word ‘कोमल’ (pronounced as: komal) in a phonetic keyboard sequence that maps क to k and ो to o, they could start by pressing ‘k’ and ‘o’ and lo and behold (no not Baba Yaga, but) a drop down menu opens up with possible words starting with ‘ को’ . From this list the user may then choose one to complete the word they had intended to type. List of words from a backend database feeds this list. Each language gets a database of its own, compiled from available text in that language. Users can add new words to the list as well.

The typing booster requires that the IBus Input Method is installed in the system. The other necessary packages to get Indic Typing Booster working are:

  • ibus-indic-table
  • <language-name>-typing-booster-<keymap-name> (i.e. for Bengali Probhat you would be looking for the bengali-typing-booster-probhat package)

If you are using Fedora, then all these packages can be easily installed with yum. If you are not, then the necessary information for download and installation is available at the Project Home page: https://fedorahosted.org/indic-typing-booster

Besides erasing the need for looking for appropriate keys while maneuvering through the inherent complications of Indic text, the typing booster could evolve into the much needed solution for Indic typing on tablets and smartphones.

After Marathi, Gujarati and Hindi, the Indic Typing Booster is now available for Bengali (yay!). The Bengali database is by far the biggest store so far, thanks to the hunspell list that was created through an earlier effort of Ankur. Pravin announces the new release here.

This is what it looks like.

So to write কিংকর্ত্যবিমূঢ়, I could either type r/f/ZbimwX or just press 4 to complete it.

Do please give the Indic Typing Booster a go and if you’d like to contribute then head over to the mailing list – indic-typing-booster-devel AT lists.fedorahosted.org or IRC channel – #typing-booster channel (FreeNode).

Traditions And Technology – Talk transcript from KDE India conference

The following is the transcript of my talk at the KDE India conference.

Hello, my name is Runa and I have been working as a Localization Specialist for the past 8-9 years. I often get asked as to why I do what I do and most of the time I end up saying “I am making way for choices”. Well.. don’t we all love choices… candies, clothes, books, shoes, gadgets, food. Anything, as long as compulsion can be repelled. Choice equates to Liberation.

However, there are times when compulsions emerge as a necessary evil that one has to live with. Sometimes they are not really evils but things not within the known circuit that it is natural to resist it. The journey of time and events in history of human kind has enough examples when ordinary lives have been irreversably changed. Thankfully most of us in this room have not had to live through the great wars or even perhaps the most important historical event of our country – the turmoil during the independence, partition and migration of population. Nevertheless we have lived through events which are significant in the current times – the dotcom bust, global economic turmoil, unemployment, inflation, population and most importantly the rise of information based technology and remote communication.

During my schooldays when dinosaurs roamed the earth, people socialized over tea parties. Very few homes had a telephone. Mobile phones were unheard of and telex machines were the most advanced technolgy used in offices. During college years it was still a miracle if one could actually have an email id for themselves. And even then it was just one of the fancy things to play around with… nothing mainstream. And this is probably not my story only, quite a number of us would vouch about a similar chain of events. When we add all these stories up, we actually see a much larger inter dependent impact. It has impacted people of multiple generations, multiple academic background, professions et al.

The wheel changed the human civilization, steam engine brought in the industrial revolution, steel ushered in the contemporary industrial advancement for economic development. And advancement in remote communication methods brought in what we know as globalization. All of these have influenced the adapted lifestyles of the human civilization in varying degrees and speed of course.

Socio-economic factors have always left a significant impact. We live in bigger, wider cities, travel longer, and far. The houses we live in don’t just need to be protected against earthquakes and fire and equipped to have special corners for refridgerators and washing machines. They need to be provided sockets for network ports, a place where the computer table would go, electric sockets for the mobile charging ‘stations’ – considering there’ll be quite a few in every household and also for the visitors. Well these are signs of a natural evolution over time. But this time, the evolution was so fast and hurried that it seemed to have happened overnight.

Modern communication methods seem to have wrapped us in an invisible web. There are probably multiple studies in various business schools about the expansion of the mobile phone consumer market and the impact it created. As a general observer of the modern Indian soceity I see it happening in two folds – compulsion exerted by one segment of the soceity and eventual cost/benefit ration. So this is how I’d explain the first factor – there was the initial group of users who had the means of using the technology and also saw some merits of using it. But their real life communications was never limited to groups of similar means… it included a lot many other people who were then compelled to adopt the same technology out of necessity. The chain continued and with subsequent affordable plans in place it did not take long for this chain reaction to continue. Now we have the individual service providers like cab drivers, grocers, milkmen and others who have to deal with consumers armed with mobile phones making use of this technology, not just to expand their businesses but to survive!

Similarly, the analogy extends to web and internet based communication methods. Where the visual medium plays an advantageous role. Imagine a scenario like this one. You have moved into a new house and you are getting some furniture customized for your rooms. However, you being an extremely busy person with very limited time it has become impossible for you to go check the designs at the carpenter’s workshop. As a result your furniture is not complete and the impasse continues. This situation could have been easily avoided if your carpenter and you could have connected where neither party needed to by physically present. For eg. if the designs were sent over to you over email and then you and your carpenter could have worked on the details over the phone. Now this is scenario not unusual now, but was until a few years back. However, it is not that all the carpenters have armed themselves to communicate with their consumers this way. But the ones who haven’t do risk losing their business because their consumer base is changing. The younger technology charged generation in the households are coming of age and they are the ones who would be doing the major portion of the spending.

However, this is from a business perspective. Even homes have seen a significant change in lifestyle. With the industries and employment opportunities being created around information based technology and resultant industries like real estate, hospitality, banking, retail etc. expanding around it, a large group of the employable population are seeking employment in the cities that are considered as the hubs. Children are leaving home much more than the earlier generations and with communication not being restricted to a postcard or an odd trunk call like earlier times, migrating is not seen as a major upheaval in the household. But what we do see is the way the older generation is being touched by these gadgets of modern communication. Stories about resistance to using them is what urban legends are made of in households.

Well, we can go on for hours about similar stories and analogies, but I am assuming that most of us in this room have been through similar situations. Further emphasis is not needed. Cut to the bottomline, the fact that emerges is that THIS is the world we now live in and we are not going backwards. This is somewhat similar to how things are at the place where I live. I live in this city called Pune and the first thing that anyone would notice when they land in the city are the huge numbers of crawling 2 wheeled vehicles. Motorbikes and scooters jostle for space on the narrow roads. Most car owners also have a bike or a scooter as well. The city however does not have much public transport. Now this is a city that was essentially a bicycle town. The bicycles graduated onto scooters and motorbikes. With the city expanding horizontally and the population increasing in large numbers the two wheeler population also increased. Road space is also at a premium in most places with limited scope of expansion. As a result the jostling on the road continues, but we are not going back to the era of the bicycles. In both cases this is the foundation that everyone would need to adapt to.

Well, this is not necessarily a bad situation, but it certainly is largely an unprecedented one. The most important difference is that we have a reversal of traditional roles in knowledge dispersion. Traditionally, knowledge has been passed in a linear format. The older generation learns, gathers experience, mentors the new generation. The new generation merges new knowledge to the pool, gathers further experience and passes on the next generation. This is how workplaces and households have functioned for ages. The linear chain is broken and the information flow has shifted to start from the newer generation and work its way upwards. The older generation are faced with the task of putting aside their traditional tools and learn how to function with the new. Many of us have probably encountered this in our homes, where our parents, uncles, aunts and other elders have suddenly been confronted with a computer at their office desks and have been made to take mandatory lessons to learn to work with them.

If like me, you were born sometime in the late 1970s or 1980s in India, then you would be somewhere in the middle of all this turmoil. We have seen world the before the communication revolution happened and also being young learners have adapted rather fast to the change. Children born sometime after the middle of the 1990s have woken up to a wired world and do not feel lost when confronted with a desktop computer or a smart phone.

Given the comparitively modern nature of technologies that is being used, the workforce engaged in its development, proliferation and maintenance is the younger generation. What this group of creators bring in is an infusion of modern culture, their language and fusions, terminology. Most importantly their slang – by this I mean the term as used in linguistics i.e. colloquial terms and also local flavours. Local flavours – primarily because of the global nature of the workforce and their consumers.

Infusion of local flavours whether in language or culture to a basic substrance is not a new phenomenon. And it has been widely used in areas like advertising , where the same films and prints have been used with dubbed scripts or redone to be more identifiable with the target audience. Even television shows (like Idols or Dancing with the stars, Who wants to be a millionaire etc), which have been redone to suit viewer tastes . When it comes to modern technology it is assumed to be based upon English. Which is not completely untrue. Compounded by the second assumption of having a very high learning curve, this adds to a fear of the unknown by a large group of the population. The second reason is understandable and can be a cause for worry although not unsurmountable, but its the first one that poses a very interesting problem.

When we considered the userbase earlier, I had mentioned that the role reversal had caused the older generation to have assumed the role of learners. Without prejudice I’d like to state here from personal experience, that I do sympathise with them after coming to terms with the fact that learning new things – whether its a language or a skill like driving does become comparitively difficult after one hits a certain age. That may differ for individuals so I would not put a number to it.

A good percentage of this group of people have not been academically trained to use english as their primary language of communication. They may have picked up the skills from frequently using it at their workplace or other areas while communicating with this very large group of multi-lingual people otherwise known as their countrymen. Also perhaps this is the legacy that we would continue by the name of ‘Indian English’ and we already have several dialects of it in various corners of the country. This group of people also consist a large number of the present generation who have adapted the use of the spoken and written form of English to be used along the multiple other languages that they speak. Because in India children grow up learning atleast 2 to 3 languages. Here I’d like to mention about a student who is hosted with my family back home in Kolkata. This young lad comes from a small town outside Kolkata and is an undergraduate student at a city college. Ever since the local cable operators have started providing the Nat Geo channel with Bengali dubbings, it has been a trying task to get him away from the television. For a couple of hours every night he is engrossed in a world that he probabaly did not know existed earlier. And for once no one is really complaining about spending too much time in front of the telly.

Anyways, back to our demographics. There is this other group of people who have learnt to use a version of English that does not include the contemporary flavour that has evolved out of primarily US American and even Australian slang – yet again I say the word from the linguistic perspective. Hence for these people, ‘default’ translates into ‘breaking a rule’ and the word they would have used to mean ‘standard value’ would be ‘de facto’. There are more such examples where conflicts exist.

Another group of people are the vocationally trained service providers. This is the group where our neighbourhood carpenter comes into the picture. They are highly skilled people but we might be stretching our expectations a tad too much, if we assume that they would be able to parse l33t if they encounter it on the interface of a desktop application. With repeated usage – perhaps yes. However, the substance may not be interpreted with full appreciation.

As part of localizing desktop applications I have been reading user interface messages from a very large number of applications, including Office suites, file browsers, web browsers, chat clients, games etc. Now one of the fun things that people like to do with their workspaces and gadgets is decorating them. Like setting a nice wallpaper or ringtone. Wallpaper names and colours often pose a very serious problem. These are steeped in cultural connotations. Buildings, landmarks, plants, flowers, fruits, food, fauna, dances, musical instruments, scenes from festivals, sports are very often used as images. Same with colours. I had a very tough time some years back when trying to translate the names of some colours named after speciality wines from various regions of France. And another example that I can recall are various shades of black and white named after the stages of a snowstorm. Translating that for an audience of a tropical country is a serious challenge. In this regard, I really have to mention and I can’t say this enough is that the user interface messages of Mozilla Firefox Browser are perhaps one of the best examples of extremely culture-neutral UI messages. They are precise, and convey the matter with extreme brevity without being flowery.

Anyways, going back to the part about tailoring products according to local requirements what we have seen so far is and attempt to create localized versions based upon language rather than identifiable cultural parallels. One of the basic methodology when it comes to helping people to adapt to a new technique or technology is to create an environment that is familiar for them to play around with. Familiarity creates a comfortable footing for further exploration. Choices and options allow flexibility to the process. With additional handholding it gets better and easier to learn. And what better way to create a comfortable learning environment than to use local and time tested analogies from the mainstream.

This is where a good number of us get to play a part. As creators either primary or secondary – i.e. people like the localizers like me. In the first place we need to settle the part about why we do need localized or localization ready application. Right from education to social networking, digital libraries, information gathering like the ongoing census, GIS to disaster management the userbase across the globe is going one way – increasing. We need production ready pieces of software to be provided to users who may not be in a position to learn them at leisure or may have to learn them before they are repelled enough to give up. In such cases, facilitating the uptake can be provided through multiple fronts of convenient choice including a choice of languages and/or script familiarity.

Also in the process, power users can be converted to the role of creators. With a better understanding of the operational domain they can interpret the functional aspects of the applications with more precision. In domains like disaster management where local inhabitants of the affected areas are deployed, choice of language plays an important part in ensuring that participants can seamlessly work in ensuring that the operation runs without hindrances.

Lets take the example of GPS devices are being increasingly used by people who drive around on highways and within cities. In earlier days, one just had to pull down the window and holler at a passerby to ask for directions. If one had ventured into an area where the local language was an unfamiliar one, a local driver or guide could be hired to do away with that problem. This option has not really vanished, but for all practical purposes people do not mind having added features in their GPS devices that would bridge this gap as well to ensure that they are not really stuck without an option. Localization is often referred to as a low hanging fruit for entrants into the Open Source Software world. Maybe it is. The way I see it is, the lower it hangs the more number of mouths it feeds.

The primary aim of all application developers is to have people use the products that they create. Probably some start their projects to solve a nagging itch of a problem that they encounter. Most open source projects gather people along the way. From nearly every possible corner of the world. While we here in India ponder over what a ‘Bordeux’ colour of wallpaper would look like, someone in South Africa would probably be posting on wordpress using the blogging application with a Bengali title – Lekhonee. With an open gate for new entrants to contribute in evey possible aspect of creating the tools of the modern and wired world, each one of us have the potential to bring forth the learnings from our individual culture and homes to better equip these tools and resources.

There have always been passionate people who do not hesitate to dive headlong into what they believe would take forward their ideals and beliefs. I remember the blogger greatbong, write about the 10paisa poet, who came from a district town in West Bengal to the Kolkata Book Fair and used to roam all over the fair grounds reciting his poems. During those 10 days the man lived on the pavements. All for the love of his creations. In most cases passion emerges during adversity and thrives in stability. Our positions are enviable in many ways. We live in a free culture where social mouthpieces like microblogging can create a direct impact. One does not have to wait for newspapers or politicians to take up cudgels on their behalf. Armed with academic qualifications, a better understanding of the modern technology and global culture , and the freedom to create, we hold a key to adopt the roles of makers and set a direction for the society. How we do it though is something that each of us has to discover.

Lastly, I’d like to conclude with this message in memory of a fellow member of the Pune Linux User’s group and Blender artist Zoyd aka Vinay Paway. He passed away last year in an unfortunate road accident. He had worked on the movie Sintel and couple of days before he passed away he was working on getting the English dialogues in the film translated in various Indian languages. We could only finish the subtitling work in Bengali and Hindi. More language translations have been pledged in his memory and do please come forward to complete if you’d like to join in.

Ra-Jhaphala in Qt Applications

While writing text in many Indian languages we encounter composite characters comprised of various combinations of more than 1 consonants and/or dependent vowels. Generally, these are written as:

1. Consonant + Joiner + Consonant (+ Dependent Vowel Sign)
2. Consonant + Dependent Vowel Signs (which will determine how what vowel sound would be used to pronounce the consonant)

However, there are exceptions where a straight implementation of the writing rules cannot be used for text input in an i18n-ized application. An example is the curious case of the two alphabets – র (aka RA, Unicode: U+09B0) and য (aka Ya, Unicode: U+09AF). These two consonants allow two different composite characters to be written, in the same sequence of usage[1].

Sequence 1:

র্য = To write words like আর্য (pronounced as ‘Ar-j-ya’, the ‘j’ is an exception in pronouncation practised in Bengali)

Sequence 2:

র‍্য = To write words like র‍্যান্ডম (i.e. transliterated version of the word ‘random’ that is pronounced as ‘rya-n-dom’ and hence has to be transliterated appropriately)

In both the above cases, র and য need to combine in the same sequence. Hence the simple method of writing them as র + joiner + য would not work in both cases. Due to a higher frequency of usage in Bengali words, this combination has been assigned to Sequence 1. For Sequence 2, an additional character ZWNJ (U+200C) had to be used. However, since Unicode 5.0 this has been changed and instead of ZWNJ, ZWJ (U+200D) is to be used to write Sequence 2.

” …Unicode Standard adopts the convention of placing the character U+200D ZWJ immediately after the ra to obtain the ra-yaphaala…”

– from the Unicode 5.0 book, pg. 316 (afaik the online version is not available)

The next challenge was to ensure that this sequence was rendered correctly when used in a document. While it was correctly displayed on Pango, ICU and Uniscribe, Qt majorly broke [bug links: KDE Bugs, Qt, Fedora/Red Hat Bugzilla]. After much prolonged contemplation, Pravin managed to push in a patch to fix this issue in Harfbuzz that’ll also make it to Qt. This fixes the issue of rendering.

The review discussion for this patch (which is also expected to resolve a few other issues) is happening here. However, the delay in updation of the much outdated entry in the Unicode FAQ led to a lot of confusion about whether the usage of U+200C had indeed been discontinued in favour of U+200D. This needs some kind of prompt action on the part of whoever maintains that FAQ. (Sayamindu had also mentioned it in his blog earlier)

[1] Two consonants can be used to write two different composite characters, varied by different sequence of usage.


The other major issue that is underway in the same review discussion is about allowing the input of multple split dependent vowel signs as a separate valid dependent vowel.

Eg. ক (U+0995) + ে (U+0997) + া (U+09BE) to be allowed as an alternative input sequence for ক (U+0995) + ো (U+09CB)

The Devanagri equivalent would be:

क (U+0915) +े (U+0947) +ा (U+093E) to be allowed as an alternative input sequence for क (U+0915) +ो (U+094B)

In general practice, when a dependent vowel is written after a consonant it completes the composite character. Multiple dependent vowels are not allowed to be written for one single consonant. While the pictoral representation in the above example may be similar, but in reality the spilt vowel sequence may lead to incorrect rendering across applications (in future for URLs as well) if the code points are stored as such. In applications using Pango, the second vowel input is displayed to the user as an unattached vowel sign with a dotted circle. This would automatically warn the user about an invalid sequence entry.

Since Qt (and looks like Uniscribe too) uses this practice, perhaps a specification is floating around somewhere about how the conversion and storage for such input sequences is handled. Any pointers to this would be very helpful. At present I am keeping any eye on the Review discussion and hopefully the issues would be resolved to ensure an uniform standard persists across all platforms.