July 24, 2013

Hum bolta ko bolta bolta hain…

Posted in Work & Technology tagged , , , , at 6:11 pm by runa

The findings of the recently conducted survey of the languages of India under the aegis of the People’s Linguistic Survey of India have been the talking point since the past few days. The survey results are yet to be published in its entirety, but parts of it has been released through the mainstream media. The numbers about the scheduled and non-scheduled languages, scripts, speakers are fascinating.

Besides the statistics from the census, this independent survey has identified languages which are spoken in remote corners of the country and by as less as 4 people. From some of the reports[1][2] that have been published, what one can gather is that there are ~780 languages and ~66 scripts presently in use in India. Of which the North Eastern states of India have the largest per capita density of languages and contribute with more than 100 (closer to ~200 if one sums things up) of those. It has also been known that in the last 50 years, ~250 languages have been lost, which I am assuming means that no more speakers of these languages remain.

This and some other things have led onto a few conversations around the elements of language diversity that creep into the everyday Indian life. Things that we assume for normal, yet are so diametrically varied from monolingual cultures. To demonstrate, we picked names of acquaintances/friends/co-workers and put 2 or more of them together to find what was a common language for each group. In quite a few cases we had to settle that English was the only language a group of randomly picked people could converse in. Well if one has been born in (mostly) urban India anytime onwards from the 1970s (or maybe even earlier), this wouldn’t be much of a surprise. The bigger cities have various degrees of cosmopolitan pockets. From a young age people are dragged through these either as part of their own social circle (like school) or their parents’. Depending upon the location and social circumstances English is often the first choice.

When at age 10 I had to change schools for the very first time, I came home open-mouthed and narrated to my mother that in the new school the children speak to each other in Bengali! Until that time, Bengali was the exotic language that was only spoken at home and was heard very infrequently on the telly on sunday afternoons. The conservative convent school where I went was a melting pot of cultures with students from local North East Indian tribes, Nepalis (both from India and Nepal), Tibetans, Chinese, Bhutanese and Indians from all possible regions where Government and Armed Forces personnel are recruited from. Even the kid next door who went to the same school spoke in English with me at school and in Bengali at the playground in the evening.

The alternative would be the pidgin that people have to practice out of necessity. Like me and the vegetable vendor in the sunday market. I don’t know her language fluent enough to speak (especially due to the variation in dialect), she probably hasn’t even heard of mine, and we both speak laughable hindi. What we use is part Hindi and part Marathi and a lot of hand movements to transact business. I do not know what I would do if I was living further south were Hindi is spoken much less. But it would be fun to try out how that works.

An insanely popular comic strip has been running since the past year – Guddu ang Gang, by Garbage Bin Studios. The stories are a throwback to our growing up years from the late 80s and 90s and touched so many chords on a personal level. The conversations are in Hindi, but the script they use is English. Like so many other thousands of people I have been following it and even purchased the book that came out. But maybe it wouldn’t have been the same amount of fun if the script was in Devanagari. I don’t read it fast enough. And no, in this case translating the text won’t make any sense. There is Chacha Chaudhary for that. Or even Tintin comics. Thanks to Anandamela, most people my age have grown up reading Tintin and Aranyadeb (The Phantom) comics in Bengali. There also exist juicy versions of Captain Haddock’s abuses.

Last year I gave a talk at Akademy touching on some of these aspects of living in a multi-cultural environment. TL&DR version: the necessities that requires people to embrace so many languages – either for sheer existence or for the fringes, and how we can build optimized software and technical content. For me, its still an area of curiosity and learning. Especially the balance between practical needs and cultural preservation.

** Note about the title: bolta – Hindi:’saying’, Bengali:’wasp’. Go figure!

February 16, 2013

Building a Standardized Colour Reference Set

Posted in planetarium, Work & Technology tagged , at 2:23 pm by runa

Among the various conversations that happened over the last week, one that caught my attention was about having a standardized list of colour for translation glossaries. This has been on my mind for a long time and I have often broached the subject in various talks. Besides the obvious primary hurdle of figuring out how best to create this list, what I consider of more importance is the way such a reference set ought to be presented. For word based terminology, a standard mapping like:

key terms -> translated key terms (with context information, strongly recommended)

is easy to adopt.

However, for colours this becomes difficult to execute, for one very important reason. Colours have names which have been made to sound interesting with cultural or local references, nature (again maybe local or widespread) popular themes or general creativity. This makes it hard to translate. To translate colour names like ‘Salmon’ or ‘Bordeaux’ or <colour-of-sea-caused-by-local-mineral-in-the-water-no-one-outside-an-island-in-the-pacific-has-heard-of> one has to be able to understand what they refer to, which may be hard if one has never come across the fish or the wine or the water. To work around that, I have been using a 2-step method for a while which is probably how everyone does anyways (but never really talks about):

colour-name -> check actual colour -> create new name (unless of course its some basic colours like Red)

So, a natural progression in the direction of standardizing this would involve having the actual colour squeezed in somewhere as the context. Something on the lines of:

Colour Name -> Sample of the of colour -> Colour Translation

like:

Salmon         salmon          ইঁট

It would be good to have something like this set up for not just translation, but general reference.

December 31, 2012

Translation Sprint for Gaia

Posted in planetarium, Work & Technology tagged , , at 5:21 pm by runa

Last saturday i.e. 29th December 2012 we had a translation sprint for Ankur India with specific focus on Gaia localization. The last few weeks saw some volunteers introducing themselves to participate in translation and localization. The Firefox OS seemed like a popular project with them that was also easy to translate. However, the reigning confusion with the tool of choice, was not easy to workaround. The new translators were given links to the files they could translate, and send over to the mailing list/mentor for review. Going back and forth in the review process was taking time and we quickly decided on the mailing list, a date to have a translation sprint. We used the IRC channel #ankur.org.in and gathered there from 11 in the morning to 4 in the evening. The initial hour was spent to set up the repository and to decide how we were going to manage the tasks between ourselves. Two of us had commit rights on the Mozilla mercurial repository. Of the 5 translators, two participants were very new to translation work, so it was essential to help them with constant reviews. By the end of the second hour, we were string crunching fast and hard, translators were announcing which modules they were picking (after some initial overlooking of this, prolly due to all the excitement) and then pushing them into the mercurial repository. We shut shop at the closing time, but had a clear process in place which allowed people to continue their work and continue the communication over email. All it needed was an IRC channel and a fundamental understanding of the content translation and delivery cycle.

SUMMARY

Participants:

  • Biraj Karmakar (biraj),
  • Priyanka Nag (priyanka_nag),
  • Runa Bhattacharjee (arrbee/runa_b)
  • Samrat Bhattacharya (samratb),
  • Sayak Sarkar (sayak)

Translation Statistics:

  • At Mozilla Dashboard – 39% translated. (Does not include the files still being reviewed)

What worked:

  • Communication was live
  • Faster turnaround of translation -> reviews -> revision
  • Queries were resolved faster
  • Commits were immediately made into the repository
  • Workflow was established to ensure the committers were being notified of files ready to go into the repository
  • No overlapping of translation

What could have been nicer:

  • A simpler tool to track the translation, through *one* interface. (Discussed many times earlier, and comments can be directed to the earlier post)
  • Pre-decided work assignments to start things off (this was rather hastily put up)
  • More time

Follow up:

  • There is still more to do and the translation has to continue. Not just for Gaia, but for other projects as well.
  • A review session for all the translated content. Besides catching errors and omissions of various nature, this can be of particular benefit to the new translators who can gauge the onscreen context of the content that they had to blindly translate

November 17, 2012

Mozilla L10n – Discussion Notes

Posted in planetarium, Work & Technology tagged , at 7:24 pm by runa

During the recently held Language Summit at Pune, we got an opportunity to discuss about a long standing issue related the localization process. Several discussions over various media have been constantly happening since the past couple of years and yet a clarity on the dynamics were sorely missed. A few months back a generic bug was also filed, which helped collate the points of these dispersed discussion.

Last week, we had Arky from Mozilla with us who helped us get an insight on how things currently stand in the Mozilla Localization front. Old hats like me who have been working on the localization of Mozilla products since a long time (for instance, I had started sometime around Firefox 1.x), had been initiated and trained to use the elaborate method of translation submissions using file comparision in the version control system. During each Firefox release, besides the core component there are also ancilliary components like web-parts that need to be translated on other Version control systems or through bugs. Thankfully there is now the shipping dashboard that lists some of these bits at one url.

However, recently there have been quite a few announcements from various quarters about Mozilla products being made available for translation through several hosts/tools – verbatim, narro, locamotion, even on transifex. Translators could gather files and submit translations via these tools, yet none of them deprecated the earlier method of direct submission into the servers through the Version Control Systems. The matter was much compounded with also a spate of translations coming in from new translators who were being familiarized with translation work at various local camps as part of Mozilla’s community outreach programs.

During the above mentioned session, we sought to find some clarity on this matter and also to understand the future plans that are being undertaken to reconcile the situation. Firstly, we created a list of all the tools and translation processes that are presently active.

 

1. Direct Submission into Mercurial or SVN

2. https://localize.mozilla.org/ – Aka ‘Verbatim‘ is essentially a version of pootle running on a server hosted by Mozilla. Used to translate web-parts, snippets, SUMO content etc.

3. mozilla.locamotion.org – Hosted by the http://translate.sourceforge.net group, and runs an advanced version of pootle. Used to translate Firefox, main.lang etc.

4. Narro – The Narro tool that allows translations of Firefox, components of Thunderbird, Gaia etc

5. Pontoon Project – To localize web content. More details from the developers here.

6. Transifex – Primarily gaia project

7. Babelzilla – Mozilla Plug-ins

There could be more beyond the above.

The reason that was given for the existence of all these tools is to allow translators to choose a tool that they were ‘comfortable with‘. This however gives rise to quite a few complications involving syncing between these tools which evidently provide duplicate platforms for some of the projects and also about maintaining a trace of translations by the translation coordinators. Especially when the direct submission into VCS is still pretty much an open option for translators (coordinators) who may have not be aware of a parallel translation group working on the same project on another translation platform.

A new project called ELMO is aimed at rectifying this situation. This would host the top-level URI of the Mozilla Localization project, with direct links to each Language’s home page. The home page intends to list the Translation team details and urls for the projects. However, there is one big difference that seemed apparent: Unlike other Translation Projects which provide one umbrella translation team, each of the Mozilla products can have different Translation Teams and Coordinators, independent of each other. It may be a scaleable solution for manpower management, but leaves a big chance of product continuity going off-sync in terminology and translation. However, it may be a good idea to wait and see how Elmo fixes the situation.

Meanwhile there were a few action items that were fixed during the discussion (Thanks Arky!), these were:

1. A page on the mozilla wiki listing *all* the translation tools/hosts that are active and the projects that they host

2. Follow up on the discussion bug for “Process Modification”

3. A request to have automatic merging of strings modified in source content into the l10n modules in Mercurial (i.e. the strings identified via compare-locale). For instance, the comparision between the bn-IN and en-US module for the Aurora branch can be found here (cron output).

4. Explore the possibility to identify a consolidated Project calendar for all the Mozilla l10n projects. (Reference comment here)

As Arky mentioned during the discussion, there were plans that were already underway to implemention and I am quite excited to wait and see how things go. Some blogs or updates from the Mozilla L10n administration team would be really helpful and I hope those come in quick and fast.

Attendees:

Arky
Amir Aharoni
Sayak Sarkar
Ankit Gadgil
Ani Peter
Sweta Kothari
Jaswinder Singh
Rajesh Ranjan
Shankar Prasad
Nilamdyuti Goswami
Shantha Kumar
Manoj Giri
Krishnababu Krothrapalli
…(Please leave a comment if I missed your name)

July 10, 2012

Akademy 2012 Talk Transcript – Localizing software in Multi-cultural environments

Posted in Uncategorized, Work & Technology tagged , , , , at 11:26 pm by runa

The following is my talk transcript from Akademy 2012. During the talk I wavered quite a bit from this script, but in the end I managed to cover most of the major bits that I wanted to talk about. Either way, this is the complete idea that I wanted to present and the discussions can continue at other places.


Good morning. The topic for my talk this morning is Localizing software in Multi-cultural environments. Before I start, I’d like to quickly introduce myself. For most of my talks, I include a slide at the very end with my contact details. But after the intense interactive sessions I forget to mention them. I did not want to make that mistake this time. My name is Runa and I am from India. This is the first time I am here in Estonia and at Akademy. For most of my professional life, I have been working on various things related to Localization of software and technical documentation. This includes translation, testing, internationalization, standardization, also on various tools and at times I try to reach out about the importance of localization and why we need to start caring for the way it is done. That is precisely why I am here today to talk about how localization can have hidden challenges and why it is important that we share knowledge and experience on how we can solve them.

Before I start the core part of the discussion today, I wanted to touch base on why localization of software is becoming far more important now. (I was listening to most of the talks in this room yesterday. And it was interesting to note that a continuous theme that reappeared in most of the talks was about finding ways to simplify adjusting to a world of growing devices and information). These days there is a much larger dependence on our devices for communication, basic commercial needs, travel etc. These could be our owned devices or the ones at public spaces. It is often assumed to be an urban requirement, but with improvement in communication technology this is not particularly the case. Similarly, the other concept is that the younger generation is more accustommed to the use of these devices, but again that is changing – out of compulsion or choice.

The other day I was watching this series on BBC about the London Underground. And there was this segment about how some older drivers who had been around for more than 40 years opted out of service and retired when some new trains were introduced and they did not feel at ease with the new system. Now I am not familiar with the consoles and cranks in the railway engines but for the devices and interfaces that we deal with, among other things localization is one major aspect that we can use to help make our interfaces easy. We owe it to the progress that we are imposing in our lives.

The reason I chose to bring this talk to this conference was primarily for the fact that it was being held here, in Europe. In terms of linguistic and cultural diversity, India by itself perhaps has as much complexities as the entire continent of Europe put together. However, individual countries and cultural groups in Europe depict a very utopian localization scenario, which may or may not be entirely correct. I bring this utopian perspective here as a quest, which I am hoping will be answered during this session through our interactions. I’ll proceed now to describe the multi-cultural environment that I and most of my colleagues in India work in.

Multi-cultural structure:

Firstly, I’d like to tell here that I do not use the term multi-cultural from any anthropological reference. Instead it is a geopolitical perspective. Present day India is divided into 28 states and 7 union territories, and the primary basis for this division is … well ‘languages’. I’d like to show you a very simple pictoral representation of how it essentially is at the ground level.

The big circle here is our country and the smaller ones are the states.Each of the states has a predominant population of the native language speakers. Some states may even have multiple state languages with equally well distributed population. At this point, I’d like to mention that India has 22 languages recognised by the Indian constitution for official purposes, with Hindi and English being considered the primary official languages. The latter being a legacy from the British era. The individual states have the freedom to choose the additional language or languages that they’d like to use for official purposes and most states do have a 3rd and sometimes 4th official language. So the chances are that if you land up at a place where Hindi is not the primary language of communication, you’d see the public signs written in a minimum of 3 languages. Going back to our picture, I have marked the people in each of these states in their distinctive colour. They have their own languages and their own regional cultures.

However, essentially that is not the status quo that is in place. So we have people moving away from their home states to other states. Why? Well, first for reasons of their employment in both government and private sector jobs. Education. Business. Defence personnel and various other common enough reasons. And given its a country we are talking about, people have complete freedom to move about without additional complications of visas or residence permits. So in reality the picture is somewhat like this.

The other multi-cultural grouping is when languages cross geographical borders. Mostly due to colonial legacy or new world political divisions and migration, some languages exist in various places across the world.

Like Spanish or French and closer home for me, my mother tongue Bengali that is spoken in both India and Bangladesh. In these cases, the languages in use often take the regional flavours and create their own distinctive identity to be recognised as independent dialects – like Brazilian Portuguese. While sometimes they do stay true to their original format to a large extent – as practised by the Punjabi or the Tamil speaking diaspora.

While discussing the localization scenario, I’ll be focusing on the first kind of multi-cultural environment i.e. multiple languages bound together by geographical factors so that they are forced to provide some symmetry in their localized systems.

Besides the obvious complexities with the diversity, how exactly does this complicate matters on the software localization front? To fully understand that, we would need to first list the kind of interfaces that we are dealing with here.

In public spaces we have things like ATM machines, bank kiosks, railway and airline enquiry kiosks, ticketing machines, while on the individual the front we have various applications on desktop computers, tablets, mobile phones, handheld devices, GPS systems etc. If you were here during the talk Sebas talk yday afternoon, the opening slide had the line devices are the new ocean. Anyways, some of these applications are of personal use, while some others may be shared, for instance, in the workplace or in educational institutes. In each of these domains, when we encounter a one to many diversity ratio the first cookie that crumbles is standardization

Language is one of the most fundamental personal habits that people grow up with. A close competitor is a home cooked meal. Both are equally personal and people do not for a moment consider the fact that there could be anything wrong with the way they have learnt it.

Going back to the standardization part, two sayings very easily summarize the situation here, one in Hindi and the other in Bengali:

1. do do kos mein zuban badal jaati hain i.e. the dialect in this land changes in every 2 kos (about 25 miles)

2. ek desher buli onyo desher gaali i.e. harmless words in one language maybe offensive when used in another language

So there is no one best way of translating that would work well for all the languages. The immediate question that would come to mind is, why is there a need to find a one size fits all solution?

These are independent languages after all. While it does work well independently to a large extent, but there are situations where effective standardization is much in demand.

For instance, in domains which pan across the diversity. Like national defence, census records, law enforcement and police records, national identity documents etc.
Complications arise not just to identify the ideal terminology but a terminology that can be quickly suited to change. The major obstacle comes from the fact that a good portion of these technonological advancements were introduced much before Indian languages were made internationalization-ready. As a result, the users have become familiar with the original english terminology in a lot of cases. There were also people who knew the terms indirectly, perhaps someone like a clerk in the office who did not handle a computer but regularly needed to collect printouts from the office printer. So when localized interfaces started to surface, besides the initial mirth, they caused quite a bit of hindrance in everyday work. Reasons ranged from non-recognition of terminology to sub-standard translations. So we often get asked the question that when English is an official language and does cut across all the internal boundaries, why do you need to localize at all? It is a justified query. Especially in a place like India, which inherited English from centuries of British rule. However, familiarity with a language is not synonymous to comfort. A good number of people in the work force or in the final user group have not learnt English as their primary language of communication. What they need is an interface that they can read faster and understand quickly to get their work done. In some cases, a transliterated content has been known to work better than a translated one.

The other critical factor comes from an inherited legacy. Before indepedence, India was dotted with princely states and kingdoms and autonomous regions. They often had their own currency and measurement systems, which attained regional recognition and made way into the language for that region. A small example here.

In Bengali, the word used to denote a currency is called Taka. So although the Indian currency is the Rupee, when it is to be denoted in Bengali the word Rupee is completely bypassed and Taka used instead. So 1 Rupee is called ek taka in Bengali. When we say Taka in Bengali in India, we mean the Indian Rupee (symbol ₹). But as an obvious choice, Taka (symbol ৳) has been adopted as the name for the currency of Bangladesh. So if a localized application related to finance or banking addresses the term for currency as Taka, a second level of check needs to be done to understand which country’s currency is being talked about here and then the calcuations are to be done. To address issues of this nature often we have translations for the same language being segregated into geographical groups, mostly based on countries.

Mono-cultural structure:

This is where I start describing the utopian dream. I do not use the word mono-cultural as an absolute, but would like to imply a predominantly mono-cultural environment. As opposed to the earlier complexity, a lot of places here in Europe are bound by homegenuity of a predominant language and culture. Due to economic stability and self-sufficiency the language of the land has been the primary mode of communication, education, and administration. There does not arise a need for a foreign language to bind the people in various parts of the country. If you know your language, you can very well survive from childhood to oldage. The introduction of new technology in localized versions completely bypassed any dependency on the initial uptake through English. Without a baggage of inherited cross-cultural legacy and bound through a commonality of technology intergrated lifestyle, the terminology was stablizied much faster for adoption. So if you knew how to use an ATM machine in one city, you could most likely be able to use another one just the same in another city. Thats probably the primary reason why various applications are translated much faster in these languages with a much higher user base. Regional differences aside, a globally acknowledged version of the language is available and not difficult to understand .

How do we deal with the problems that we face in multi-cultural places:

The first thing would probably be to accept defeat about a homegeneous terminology. It would be impractical. But that doesn’t stop one from finding suitable workarounds and tools to deal with these complexities.

1. Collaboration on translations
2. Tools that facilitate collaboration
3. Simplify the source content
4. Tools for dynamic translation functionalities
5. Learn from case studies
6. Standardize on some fronts

Collaboration on translations – When translating if you come across a term or phrase that you personally struggled to translate or think may pose a problem for other translators, it would be reasonable to leave messages on how to interpret them. Often highly technical terms or terms from a different culture/location are unknown or hard to relate to and instead of all translators searching for the term individually , a comment from another translator serves as a ready reckoner. The information that can be passed in this way are: description of the term or phrase, and how another language has translated it so that other translators of the same language or of a closely related language can identify quickly how to translate it.

Tools that facilitate collaboration – To collaborate in this manner, translators often do not have any specific tools or formats to leave their comments. So when using open source tools, translators generally have to leave these messages as ‘comments’. Which may or may not be noticed by the next translators. Instead, it is beneficial if the translation tools allow for cross-referencing across other languages as a specialized feature. I believe the proprietory translation tools do possess such features for collaboration.

Simplify the source content – However, until such features are intergrated or collaborative practices adopted, a quick win way to easier translation is to get back to the source content creators for explanations or requests to simplify their content. The original writers of the user interface messages try to leave their creative stamp on the applications. Which may include cleverly composed words, simplified words, new usage of existing words, local geographic references, colloquial slang, analogies from an unrelated field or even newly created terms which do not have parallel representations in other languages. Marta had mentioned the a similar thing yesterday during her talk – where she said that humour in commit messages should ideally be well understood by whoever is reading them. If taken as a creative pursuit, translators have the liberty to come up with their version of these creations. However, when we are looking at technical translations for quick deployments, the key factor is to make it functional. So while the translators can reach out to the content creators, the original content creators could also perhaps run a check before they write their content, to see if it will be easy to translate.

Tools for dynamic translation functionalities – Before coming here to Estonia, I had to read up some documents related to visa etc. which were not available in English. The easiest way to get a translated version of the text from German was through an online translation platform. Due to the complexity of Indian languages, automatic translations tools for them have not yet evolved to the same levels of accuracy as we can otherwise see for European languages. But availability of such tools would help benefit societies like ours, where people do move around a lot. Going back to an earlier example for a ticket booking kiosk, lets assume a person has had to move out of their home state and is not proficient in either of the two official languages or in the local language of the home-state. In such a case, our users would benefit if the application on the kiosk has a feature so that interfaces for additional languages can be generated as per requirements, either from existing translated content or dynamically. This is from interface display. The other part is to allow simplified writing applications like phonetic and transliteration keyboards for writing complex scripts quickly.

Standardize on some fronts – However, standardization is a key element that cannot be overlooked at all. As a start, terminology related to the basic functional areas where content is shared across languages need to be pre-defined so that there are no chances of discrepancy and even auto-translation functions can be quickly implemented.

Learn from case studies – And ofcourse nothing beats learning from existing scenarios of similar nature. For instance a study on how perhaps the Spanish and Italian translation teams collaborated while working on some translations may be applied somewhat effectively for languages with close similarities like Hindi and Marathi.

Conclusion

Whether in a multi-cultural environment or otherwise, localization is here to stay. With the users of various applications growing everyday, the need for their customizations and ease of use will simultaneously grow. And like any other new technology, the importance lies in making the users confident in using them. Nothing better to boost confidence than providing them with an interface that they can find their way around on.

In Agustin’s keynote yesterday afternoon, he mentioned that there is a need for patience to instill confidence during these times of fast moving technology. At a discussion some time back, someone had suggested to do away with written content on the interface and to only retain icons. Ideally, written content can never be completely removed. But yes they can be made easier to use. Sebas had shared a similar thought yesterday that technology should be made functional for user’s need and not because it was fun developing it.

A few months back the Government of India sent out a circular to its adminsitrative offices that in place of difficult Hindi words, the usage of Hinglish or a mix of English and Hindi could be used to ease the uptake of the language. I wholeheartedly shared this view and had followed up with a blog post on this where I mentioned that:

Familiar terms should not be muddled up, and Readability of the terms is not compromised,

primarily to ensure that terminology is not lost in translation when common issues are discussed across geographies, especially in the global culture of the present day that cuts across places like multinational business houses and institutes of higher education.

June 26, 2012

Akademy 2012

Posted in Work & Technology tagged , , , at 11:41 am by runa

Akademy – the annual summit of the KDE community is happening at Tallinn, Estonia this year.  It’ll be the first time I’ll be attending this conference. The schedule for this 7 day summit has talks, sessions and workshops and what I am guessing will be a lot of exciting interactions. I’ll be presenting as well and my talk is about ‘Localizing Software for Multicultural Environment‘. Its on the 1st of July and if you are a translator, write documents, develop software, use localized environments and are also attending Akademy, do please try to head to Room 1 that day. I am planning to have this session as a comparitive study session in most parts, with me presenting about localization in a multi-cultural environment and gathering the perspective from non-multi-cultural translation groups. The talk transcript will be available here on my blog right after the talk. However, if there are any questions that you’d like me to address during the talk, please do let me know over email or through the comments.

Thanks to the the Akademy team for the invitation and sponsorship. Looks like these would be days very well spent.

April 11, 2012

“The Sun Goes Around The Earth”

Posted in planetarium, Work & Technology tagged , , at 8:29 pm by runa

“THE SUN GOES AROUND THE EARTH”

If one grew up in the city of Kolkata in the 1980s and 90s, they would not be unfamiliar with the above graphiti planted on innumerable walls and lamposts. The graphiti and the adamant proponent of this theory is a legend that a generation would remember.

I was reminded of this, by a rather unfortunate turn of events that happened, on a mailing list of much repute. Just this morning, I was speaking with a colleague about how often and unknowingly we are drawn into stressful situations which make us lose focus from the task at hand. After having responded to a mail thread now crossing the 80+ mark, I wanted to step back, summarize and review this entire situation.

It all started when someone, who by his own admission is not a native speaker of Bangla/Bengali language, wanted to transcribe Sanskrit Shlokas (hymns) in the Bangla script into a digital format and requested for modifications in a in-use keymap. To what final end, is however unclear. This is not an unusual practice as there are numerous books and texts of Sanskrit that have been written in the Bengali script and this effort can be assumed as a natural progression to digitizing texts of this nature. What stands out is the unusual demand for the addition of a certain character, which is not part of the Bengali script, into a Bengali keymap (much in use) that this gentleman wanted to use to transcribe them. The situation worsens with more complications because this character is not a random one and belongs to the Assamese script.

The character in question is the Assamese character RA, written as ৰ and has the Unicode point U+09F0. This is part of the Unicode chart for the Bengali script, which is used to write Bengali, Assamese, and Manipuri (although Meitei is now the primary script for Manipuri). Although exclusively used for Assamese, this character does have a historical connection with the Bengali script. ৰ was also used as the Bengali character RA before the modern form র (Unicode point U+09B0) came into practice. At which exact point of time this change happened is somewhat unclear to me, but references to both the forms can be found as early as 1778 when Nathaniel Brassey Halhed published the A Grammar of the Bengali Language. Dr.Fiona Ross‘ extensively researched The Printed Bengali Character: Its Evolution contains excerpts from texts where the ancient form of র i.e. ৰ has been used. However, this is not the main area of concern.

Given its pan-Indian nature, Sanskrit has been written in numerous regional scripts. I remember, while at school Sanskrit was a mandatory third language of study. The prescribed book for the syllabus used the Devanagari script. On the other hand, the Sanskrit books that I saw in my home were in the Bengali script (some of my ancestors, including my maternal Grandfather were Priests and Sanskrit teachers who had their own tol). Anyway, I digress here. The main concern is around the two characters of ‘BA‘ and ‘VA‘ . In Devanagari, ‘BA‘ i.e. and ‘VA‘ i.e. are two very distinct characters with distinct pronunciations. While ‘BA is used for words that need a pronunciation such as बालक (phonetic: baa-lak), ‘VA is used for words such as विद्या (phonetic:weedh-ya). In Bengali, these two variations are respectively known as ‘Borgiyo BA‘ and ‘Antastya BA‘. However, unlike Devanagari they do not have separate characters. So both of them are represented by (U+09AC in the Unicode chart). Earlier they held two different positions in the alphabet chart, but even that has been relinquished. The pronunciation varies as per the word, a practice not dissimilar to the behaviourial aspects of the letters, ‘C‘ and ‘T‘ in English.

This is where it starts getting muddled. The gentleman in question requests for a representation of the Devanagari equivalent of the separation of BA and VA, for Bengali as well. Reason stated was that the appropriate pronunciations of the Sanskrit words were not possible without this distinction. So as a “solution” he suggested the use of the Assamese RA glyph in place of the Borgiyo BA sounds and the Bengali BA to be reserverd exclusively for the lesser used Antastya BA i.e. VA sounds. Depicted below as a diagram for ease of reference.

On the basis of what legacy this link is to be established or how the pronunciation for the two characters have been determined, meets a dead end in the historical references of the Bengali script[1].

To support his claims he also produces a set of documents[1][2] which proudly announces itself as the “New Bengali character set” (নূতন বর্ণপরিচয়/Nutan Barnaparichay) at the top of the pages. The New Bengali character set seems quite clandestine and no record of it is present in the publications from the Paschimbanga Bangla Academy, Bangla Academy Dhaka or any of the other organisations that are considered as significant contributors for the development and regulation of the language. Along with the New character set, there are also scanned images from books where the use of this character variation can be seen. However the antecedents of these books have not been clearly identified. In one of them, the same word (বজ্র) has been spelt differently in two sentences, which imho adds more confusion to the melee.

On my part, I have also collected some excerpts from Sanksrit content written in Bengali, with particular emphasis on the use of ব. Among them is one from the almanacs (ponjika) which are widely popular amongst householders and priests in everyday reference of religious shlokas and hymns.

The character in the eye of the storm i.e. the Assamse RA and its Bengali counterpart are very special characters. These form two different conjuncts each with the ‘YA’ (U+09AF that is shared by both the scripts) without changing the sequence of the characters:

র + য = র্য
র + য = র‍্য (uses ZWJ)

ৰ + য = ৰ্য
ৰ + য = ৰ‍্য (uses ZWJ)

The Bengali character set as we know it today was created by Ishwar Chandra Bidyasagar, in the form of the বর্ণপরিচয়/Barnaparichay written by him. Since much earlier, the script also saw modern advancements mostly to cater to the requirements of the printing industry. His reforms added a finality to this. The বর্ণপরিচয়/Barnaparichay still remains as the first book that Bengali children read while learning the alphabets. This legacy is the bedrock of the printed character and, coupled with grammar rules, defines how Bengali is written and used since the last 160 years. The major reform that happened after his time was the removal of the character ঌ (U+098C) from everyday use. Other than this, the script has remain unchanged. In such a situation, a New Barnaparichay with no antecedents and endorsements from the governing organisations cannot shake the solid foundations of the language. The way the language is practised allows for some amount of liberty mostly in terms of spellings mainly due to the legacy and origins of the words. Some organisations or publication houses prefer to use the conservative spellings while others recommend reforms for ease of use. The inevitable inconsistencies cannot be avoided, but in most cases, the system of use is documented for the reader’s reference. Bengali as a language has seen a turbulent legacy. An entire nation was created from a revolution centered around the language.

During this entire fiasco the inputs from the Bengali speaking crowd (me included) were astutely questioned. Besides the outright violation of the Bengali script, complications arising out of non-standard internationalized implementations which were highlighted, were waived off. What is more disappointing is the way the representatives from IndLinux handled the situation. As one of the pioneering organisations in the field of Indic localization they have guided the rest of the Indic localization groups in later years. With suggestions for implementing the above requests in the Private Use Area of the fonts (which maybe a risky proposition if the final content, font and keymap are widely distributed) and providing customized keymaps they essentially risked undoing critical implementational aspects of the Bengali and Assamese internationalization. Whether or not the claims from the original requestor are validated and sorted, personally I am critically concerned about the advice that was meted out (and may have also been implemented) by refuting the judgement of the Bengali localization teams without adequate vetting.

Note:A similar situation was seen with the Devanagari implementation of Kashmiri. Like the Bengali Unicode chart, the Devanagari chart caters to multiple languages including Hindi, Marathi, Konkani, Maithili, Bodo, Kashmiri and a few others. Not all characters are used for all the languages. While implementing Kashmiri, a few of the essential characters were not present in the Devanagari chart. However, similar looking characters were present in the Gurumukhi chart and were used while writing Kashmiri. This was rectified through discussions with Unicode, and the appropriate code points were alloted in the Devanagari chart for exclusive use in Kashmiri.

March 27, 2012

Translation – dive in!

Posted in planetarium, Work & Technology tagged , at 8:54 pm by runa

The reason I started writing this post is the recent rise in the interest towards things related to translation and localization. Everywhere one turns to there is someone evangelising this revolution from atop a soapbox and gathering people around for quick win localization projects. It may be reasonable to question if I consider this innundation of localizers as an unhappy turn of events. Hardly. After having toiled alone for ages, at times through uncharitable sneers it is indeed a welcome change. However, I have some grave reservations about how this is being done.

Off-late there has been a rising impetus on forming geography based communities around some of the significant (eye-ball grabbing) FOSS projects. With the proliferation of the projects’ user base this is a natural progression in the scheme of things. When communities are based on geographies one of the first things they tend to find commonality in is their language. Thus, enter localization. So far so good. However, this is where the slightly disruptive butterfly starts to flutter its wings.

The localization projects are also a major entry point for new contributors to be lured into the projects. It has forever been a perception that translation was the easiest way to start contributing to any open-source project. And why not? Everyone seemed to be able to read and comprehend English – the original language used in most components and the same ‘everyone’ also knew how to read and write the language that they were going to translate into. Fair enough, come join. All Hail Crowdsourcing!!

This is where the fluttering starts to get serious. Most of these localization projects were not new discoveries. Depending upon the maturity of their localization sub-projects, there are established norms of translation, review, terminology and validation, including certain methods to groom new translators. Teams are formed around a language to ensure that translations are consistently updated and polished to attain a high degree of consistency and perfection. Conventions evolve and rules honoured.

Does that make it difficult for new entrants to join? Marginally, yes. But then which other projects do not have this barrier. If it is acceptable for projects to validate and audit codes before accepting them, why should localized content be considered an open field for experiements. Especially, when compared to codes the latter is far more difficult to trace and rectify.

The following is an excerpt from an interview with Sue Gardner, Director of the Wikimedia Foundation, where she answers a query about whether new contributors were finding it difficult to work their way around the policies:

We queried her take on this second area, pointing out that all publishers that aim to present high-quality information find they need complex rules, whether explicit or via accepted standards of writing and scholarship. Could she give specific examples of areas where we could simplify policy without sacrificing standards?

Yes, the premise of this question is absolutely correct. The analogy I often use is the newsroom. Anybody who’s curious and reasonably intelligent can be a good journalist, but you do need some orientation and guidance. Just like a newsroom couldn’t invite in 100 random people off the street and expect them to make an immediate high-quality contribution, neither can Wikipedia expect that.”

What most of these populist programs tend to miss are the percolations that are felt elsewhere. For languages with large amount of published localized content that have been filtered through long periods of (mostly) manual validation, experiments on ancilliary components introduce inconsistency and worse, errors. For instance, non-validated translations in add-on components ruin the user-interface of the main component. Which in most cases is an extremely prominent application and often part of enterprise level products. These errors can be resolved by the usual bug tracking systems, but how does one chase up volunteers who had turned up for localization sprints and have moved on?

Crowdsourcing is here to stay. So will crowdsourced contributions. With more flexibility in translation tools, the new age translators do not have to go through the rigourous grooming process that were prevalent until a few years back and has shaped a lot of the veteran translators.They can get their contributions into the main projects without any delay. Often with the blessings of the sponsoring project who do not have to wait for their translation assets to multiply and their local communities to expand. With some amount of experience both as a translator and as a homemaker, the one thing that I can vouch for is that technical translation is not unlike housework – everyone has an opinion oh how easy it is but you don’t know how many corners you end up cleaning until you are down on your knees doing it.

February 3, 2012

Indic Typing Booster – Bengali

Posted in planetarium, Work & Technology tagged , , , , , , at 5:30 pm by runa

My colleagues Pravin Satpute and Anish Patil have been working for sometime on a cool tool called the Indic Typing Booster. The premise for this tool is to aid users new to typing in Indian languages. Using a normal US English keyboard (i.e. the widely available generic keyboard around here) users begin typing a word in a keyboard sequence of their choice and after a couple of key presses the typing booster prompts the user with a series of words that match the initially typed in key sequences.

For instance, if the user wanted to type the word ‘कोमल’ (pronounced as: komal) in a phonetic keyboard sequence that maps क to k and ो to o, they could start by pressing ‘k’ and ‘o’ and lo and behold (no not Baba Yaga, but) a drop down menu opens up with possible words starting with ‘ को’ . From this list the user may then choose one to complete the word they had intended to type. List of words from a backend database feeds this list. Each language gets a database of its own, compiled from available text in that language. Users can add new words to the list as well.

The typing booster requires that the IBus Input Method is installed in the system. The other necessary packages to get Indic Typing Booster working are:

  • ibus-indic-table
  • <language-name>-typing-booster-<keymap-name> (i.e. for Bengali Probhat you would be looking for the bengali-typing-booster-probhat package)

If you are using Fedora, then all these packages can be easily installed with yum. If you are not, then the necessary information for download and installation is available at the Project Home page: https://fedorahosted.org/indic-typing-booster

Besides erasing the need for looking for appropriate keys while maneuvering through the inherent complications of Indic text, the typing booster could evolve into the much needed solution for Indic typing on tablets and smartphones.

After Marathi, Gujarati and Hindi, the Indic Typing Booster is now available for Bengali (yay!). The Bengali database is by far the biggest store so far, thanks to the hunspell list that was created through an earlier effort of Ankur. Pravin announces the new release here.

This is what it looks like.

So to write কিংকর্ত্যবিমূঢ়, I could either type r/f/ZbimwX or just press 4 to complete it.

Do please give the Indic Typing Booster a go and if you’d like to contribute then head over to the mailing list – indic-typing-booster-devel AT lists.fedorahosted.org or IRC channel – #typing-booster channel (FreeNode).

December 20, 2011

হ য ব র ল – Level up

Posted in planetarium, Uncategorized, Work & Technology tagged , at 12:35 am by runa

A couple of days back the following announcement was made by the Government of India through the PTI:

In a bid to overcome problems posed by difficult Hindi words, Government has asked section officers to use their ” hinglish” replacements for easy understanding and better promotion of the language.

official circular here.

Excuse me while I whoop with joy for a moment here. Reason being, its a clear endorsement of something that I have forever followed in Bengali (India) Translations. I have argued, fought and have been ocassionally berated for not coming up with innovative Bengali words for the various technical terminology that I have translated. My steady answer has been something to the tune of – ‘don’t fix it, if it ain’t broken’.

At conferences and other places when I used to interact with people who had suddenly taken an interest in localization, they were often pretty upset that things like ‘files‘, ‘keyboards‘, ‘cut‘, ‘print‘ etc. were simply translitered in Bengali. (I am sure they did not hold very high opinions about the bunch of Bengali localizers.) So we got suggestions like – “you could consider translating ‘paste’ as ‘লেপন’ “(similar to গোবর লেপা, i suspect), or “you need to write মুদ্রণযন্ত in place of a printer“. There were more bizarre examples, which were more like words constructed with several other words (for things like URL, UTC etc.). I held my ground at that time, and hopefully this announcement has at last put my doubts (well, I did have second thoughts about whether I was being too adamant while “compromising authenticity for practicality“) to rest.

After getting the necessary i18n bits fixed, Bengali localization for desktop applications primarily came about around circa 2000. However, computer usage among the Bengali speaking/reading population has been happening for decades before that. By the time the first few desktop applications started to peek through in Bengali, there already were a good many users who had familiarized themselves with the various terms on the desktop. Users were well-familiar with:

  • clicking‘ on ‘buttons‘, or
  • going to a link, or
  • printing‘ a ‘document‘,
  • cutting‘ and ‘pasting‘,
  • pointing‘ with a ‘mouse‘ etc.

Subjecting them to barely relatable or artifically constructed terms would have squeezed in another learning phase. It just did not make sense.

In response, the other question that creeped in was – ‘then why do you need to localize at all?‘ It is a justified query. Especially in a place like India, which inherited English from centuries of British rule. However, familiarity with a language is not synonymous to comfort. Language has been a hindrance for many things for ages. Trying to read a language, one is not fully comfortable with can be a cumbersome experience. For eg. I can speak and understand Hindi quite well, but lack the fluency to read it. Similarly, there were a good number of people who did not learn English as their primary language of communication[1]. Providing a desktop which people can read faster would have gotten rid of one hurdle that had probably kept away a lot of potential users.

There were also people who knew the terms indirectly, perhaps someone like a clerk in the office who did not handle a computer but regularly needed to collect printouts from the office printer. This group of people could mouth the words but did not read them often and if the language on the desktop was not the primary language of everyday business, they probably did not even know what the word looked like. When getting them to migrate their work desks to a desktop, it is essential to ensure that the migration is seamless and gave prime importance to the following:

  • Familiar terms should not be muddled up, and
  • Readability of the terms is not compromised

Point 1 is also required to ensure that the terminology is not lost in translation when common issues are discussed across geographies and locales. For eg. in institutes of higher education or global business houses. Getting it done by integrating transliterated terminology for highly technical terms that were already in prevalence seemed like the optimum solution. It has not worked badly for Bengali (India) localization so far. We have been able to preserve a high quality of consistency across desktop applications primarily because the core technical terminology never needed to be artificially created, which also allows new translators (already familiar with desktops in most cases) to get started without too much groundwork.

Note: it is not unusual to find people in India speak fluently in 2-3 languages and not always in a pure form of any. Mixing words from several languages while conversing is quite a prevalent practice these days.

Next page

Follow

Get every new post delivered to your Inbox.