While writing text in many Indian languages we encounter composite characters comprised of various combinations of more than 1 consonants and/or dependent vowels. Generally, these are written as:
1. Consonant + Joiner + Consonant (+ Dependent Vowel Sign)
2. Consonant + Dependent Vowel Signs (which will determine how what vowel sound would be used to pronounce the consonant)
However, there are exceptions where a straight implementation of the writing rules cannot be used for text input in an i18n-ized application. An example is the curious case of the two alphabets – র (aka RA, Unicode: U+09B0) and য (aka Ya, Unicode: U+09AF). These two consonants allow two different composite characters to be written, in the same sequence of usage.
র্য = To write words like আর্য (pronounced as ‘Ar-j-ya’, the ‘j’ is an exception in pronouncation practised in Bengali)
র্য = To write words like র্যান্ডম (i.e. transliterated version of the word ‘random’ that is pronounced as ‘rya-n-dom’ and hence has to be transliterated appropriately)
In both the above cases, র and য need to combine in the same sequence. Hence the simple method of writing them as র + joiner + য would not work in both cases. Due to a higher frequency of usage in Bengali words, this combination has been assigned to Sequence 1. For Sequence 2, an additional character ZWNJ (U+200C) had to be used. However, since Unicode 5.0 this has been changed and instead of ZWNJ, ZWJ (U+200D) is to be used to write Sequence 2.
” …Unicode Standard adopts the convention of placing the character U+200D ZWJ immediately after the ra to obtain the ra-yaphaala…”
– from the Unicode 5.0 book, pg. 316 (afaik the online version is not available)
The next challenge was to ensure that this sequence was rendered correctly when used in a document. While it was correctly displayed on Pango, ICU and Uniscribe, Qt majorly broke [bug links: KDE Bugs, Qt, Fedora/Red Hat Bugzilla]. After much prolonged contemplation, Pravin managed to push in a patch to fix this issue in Harfbuzz that’ll also make it to Qt. This fixes the issue of rendering.
The review discussion for this patch (which is also expected to resolve a few other issues) is happening here. However, the delay in updation of the much outdated entry in the Unicode FAQ led to a lot of confusion about whether the usage of U+200C had indeed been discontinued in favour of U+200D. This needs some kind of prompt action on the part of whoever maintains that FAQ. (Sayamindu had also mentioned it in his blog earlier)
 Two consonants can be used to write two different composite characters, varied by different sequence of usage.
The other major issue that is underway in the same review discussion is about allowing the input of multple split dependent vowel signs as a separate valid dependent vowel.
Eg. ক (U+0995) + ে (U+0997) + া (U+09BE) to be allowed as an alternative input sequence for ক (U+0995) + ো (U+09CB)
The Devanagri equivalent would be:
क (U+0915) +े (U+0947) +ा (U+093E) to be allowed as an alternative input sequence for क (U+0915) +ो (U+094B)
In general practice, when a dependent vowel is written after a consonant it completes the composite character. Multiple dependent vowels are not allowed to be written for one single consonant. While the pictoral representation in the above example may be similar, but in reality the spilt vowel sequence may lead to incorrect rendering across applications (in future for URLs as well) if the code points are stored as such. In applications using Pango, the second vowel input is displayed to the user as an unattached vowel sign with a dotted circle. This would automatically warn the user about an invalid sequence entry.
Since Qt (and looks like Uniscribe too) uses this practice, perhaps a specification is floating around somewhere about how the conversion and storage for such input sequences is handled. Any pointers to this would be very helpful. At present I am keeping any eye on the Review discussion and hopefully the issues would be resolved to ensure an uniform standard persists across all platforms.