jointbayesianmorphologylearningfordravidianlanguages · malayalam അ എ ഒ ... grammar. although...

Joint Bayesian Morphology learning for Dravidian languages

Arun KumarUniversitat Oberta de

[email protected]

Lluís PadróUniversitat Politècnica de

[email protected]

Antoni OliverUniversitat Oberta de

[email protected]

Abstract

In this paper a methodology for learn-ing the complex agglutinative morphologyof some Indian languages using AdaptorGrammars and morphology rules is pre-sented. Adaptor grammars are a compo-sitional Bayesian framework for grammat-ical inference, where we define a mor-phological grammar for agglutinative lan-guages and morphological boundaries areinferred from a plain text corpus. Oncemorphological segmentations are produce,regular expressions for sandhi rules and or-thography are applied to achieve the finalsegmentation. We test our algorithm in thecase of two complex languages from theDravidian family. The same morphologi-cal model and results are evaluated com-paring to other state-of-the art unsuper-vised morphology learning systems.

1 Introduction

Morphemes are the smallest individual units thatform words. For example, the Malayalam word(മലക�െട, malakaḷuṭe, related to mountains) con-sists of several morphemes ( stem mala, pluralmarker kal, and genitive case marker uṭe). Mor-phological segmentation is one of the most studiedtasks in unsupervised morphology learning (Ham-marström and Borin, 2011). In unsupervised mor-phology learning, the words are segmented intocorresponding morphemes with any supervision,as for example morphological annotations. It pro-vides the simplest form of morphological analy-sis for languages that lack supervised knowledgeor annotation. In agglutinative languages, there isa close connection between suffixes and morpho-syntactic functions and thus, in those languagesthe morphological segmentation may approximatemorphological analysis well enough. Most un-supervised morphological segmentation systems

have been developed and tested on a small set ofEuropean languages (Creutz and Lagus, 2007a),mainly English, Finnish and Turkish, with few ex-ceptions in Semitic languages (Poon et al., 2009).These languages show a variety of morphologi-cal complexities, including inflection, agglutina-tion and compounding. However, when applyingthose systems on other language groups with theirown morphological complexities, we cannot ex-pect the good results demonstrated so far to be au-tomatically ported into those languages. We as-sume that morphological similarities of same lan-guage family enable us to define a general modelthat work across all languages of the family.In this paper we work with a set of Indian lan-

guages that are highly agglutinated, with wordsconsisting of a root and a sequence of possi-bly many morphemes and with each suffix cor-responding to a morpho-syntactic function suchas case, number, aspect, mood or gender. In ad-dition to that, they are highly and productivelycompounding, allowing the formation of very longwords incorporating several concepts. Thus, themorphological segmentation in those languagesmay partially look like a word segmentation task,which attempts to split the words in a sentence.Dravidian languages (Steever, 2003) are a group

of Indian languages that shows extensive use ofmorpho-phonological changes in word or mor-pheme boundaries during concatenation, a processcalled sandhi. This process also occurs in Euro-pean languages (Andersen, 1986), but it becomesmore important in the case of Dravidian languagesas they use alpha-syllabic writing systems. (Taylorand Olson, 1995).Recently, interest has shifted into semi-

supervised morphological segmentation that en-ables to bias the model towards a certain languageby using a small amount of annotated training data.We also adopt semi-supervised learning to moreeffectively deal with the complex orthography

of the Dravidian languages. We use the AdaptorGrammars framework (Johnson et al., 2007a) forimplementing our segmentation models that hasbeen shown to be effective for semi-supervisedmorphological segmentation learning (Sirts andGoldwater, 2013) and it provides a computationalplatform for building Bayesian non-parametricmodels and its inference procedure. We learn thesegmentation patterns from transliterated inputand convert the segmented transliterations into theorthographic representation by applying a set ofregular expressions created from morphologicaland orthographic rules to deal with sandhi.We test our system on twomajor languages from

the Dravidian family— Malayalam and Kannada.These languages, regardless of their large numberof speakers, can be considered resource-scarce, forwhich not much annotated data available for build-ing morphological analyzer. We build a modelthat makes use of languages morphological and or-thographic similarities. In Section 2, we list themmain morphological and orthographic similaritiesof these languages.The structure of this paper is as follows. In

Section 2 we describe more thoroughly the mor-phological and orthographic challenges presentedin Dravidian languages. Section 3 describes theAdaptor Grammars framework. In section 4, wedescribe the morphological segmentation systemfor the Dravidian languages. Experimental setupis specified in Section 5, followed by the resultsand analysis in section 6 and conclusions in Sec-tion 7.

2 Morphology of Dravidian languages

In this study we focus on Kannada andMalayalam,which are two major languages in the south Dra-vidian group. These languages are inflected andhighly agglutinative, which make them morpho-logically complex. The writing systems of theselanguages are alpha-syllabic, i.e. each symbol rep-resents a syllable. In this section we discuss mor-phological and orthographic similarities of theselanguages in detail.

2.1 Orthography

Kannada and Malayalam follow an alpha-syllabicwriting system in which individual symbols aresyllables. In both languages symbols are calledakṣara. The atomic symbols are classifiedinto two main categories (svaram, vowels) and

(vyaṅṅajnam, consonants). Both languages havefourteen vowels, (including a, ā, i, ī, u,

ū, e, ē, ai, o, ō, au, am)1, where am isan anusvāram, which means nasalized vowel. Ta-ble 1 shows some examples of their orthographicrepresentation. These vowels are in atomic formbut when they are combined with consonant sym-bols, the vowel symbols change to ligatures, re-sulting in consonant ligatures (see examples in Ta-ble 2).

Table 1: VowelsISO Transliteration a i o uMalayalam അ എ ഒ ഉKannada ಅ ಇ ಈ ಉ

Table 2: Consonant ligatureConsonant Vowel Ligature

ISO Transliteration m ī mī

Malayalam മ ഈ മീKannada ಮ ಈ Ģೕ

Orthography of both languages supports a largenumber of compound characters resulting from thecombination of two consonants symbols, as thoseshown in Table 3.

Table 3: Compound charactersISO Transliteration cca kka

Malayalam � �Kannada ಚ© ಕ¤

The orthographic systems contain characters fornumerals and Arabic numerals are also present inthe system. See examples in table 4

2.2 Sandhi changes

Sandhi is a morpho-phonemic change happeningin the morpheme or word boundaries at the timeof concatenation. Both Kannada and Malayalamhave three major kinds of sandhi formations: dele-tion, insertion and assimilation. In the case of dele-tion sandhi, when two syllables are joined togetherone of the syllable is deleted from the resultingcombination, while insertion sandhi adds one syl-lable when two syllables are joined together. Thesandhi formations found in Sanskrit are also foundin these languages as these languages loan large

1These symbols are according to ISO romanization stan-dard: ISO-15919

Table 4: NumeralsArabic Numerals 1 2

Malayalam ൧ ൨Kannada ೧ ೨

number of roots from Sanskrit. There are languagespecific sandhi, such as visarga sandhi. Visargasandhi is commonly found in the case of loanedwords from Sanskrit. Visarga is an allophoneof phonemes /r/ and /s/ at the end of the utter-ance. Kannada orthography keeps the symbol forvisarga butMalayalam orthography uses columnsymbol for representing it. As languages followalpha-syllabic orthography, these sandhi changeswill change the orthography of the resulting word,examples are listed in Table 5. In the examples,when theMalayalam stem mala (mountain) joinedto āṇ (is), a new syllable y introduced to resultingword malayāṇ. Similarly in the case of Kannadastem magu, when joined with annu, the resultingword maguyannu also includes an extra syllabley. These similarities between languages enable usto define a generic finite-state traducer for ortho-graphic changes that are caused by phonologicalchanges. An example of created rule for additionof syllabley is that when (a) is preceded with (u)and (a) a new syllable is y is added with resultingword.

Table 5: Sandhi ChangesLanguage Insertion sandhiMalayalam mala+āṇ→ malayāṇ

Kannada Magu+annu→ maguyannu

2.3 Morphology

Due to those orthographic properties of Dravid-ian languages, morpheme boundaries are markedat the syllabic level. However, during concate-nation, phonological changes sandhi may occur,so that the resulting word has a different orthog-raphy than the individual segments concatenatedtogether. That means the surface form may bedifferent from the lexical form. For example, inMalayalam the surface form of the word ക�ി�(kaṇṭilla) corresponds to the lexical form ക�+ ഇ� (kaṇṭu + illa). The changes in orthog-raphy in surface form when lexical units are com-bined are present in this example. Kannada alsoexhibits similar properties.

Malayalam and Kannada use case markers fornominative, accusative, genitive, dative, instru-mental, locative, and sociative (Agesthialingomand Gowda, 1976). In both languages, nouns in-flect with gender, number, and case. Gender ismarked for masculine and feminine, while neutercorresponds to the absence of gender marker.There are two categories of number markers: sin-gular and plural. Both languages can use plu-ral markers for showing respect. For example,the Kannada word (�ಾಯು vāyu wind) is inflectedwith masculine gender marker. Similarly, theMalayalam word (വാ� vāyu wind) has a mascu-line gender marker. In the case of verb morphol-ogy, both languages inflect with tenses, mood, andaspect. Where mood can follow arative, subjunc-tive, conditional, imperative, presumptive , abil-itative, and habitual. Aspect markers can followthree categories, such as simple, progressive, pur-posive. For example, the Kannada sentence (�ಾನುಬಂದು Nānu bandu I come), the verb bandu in-flected with present tense marker u. Similarly theMalayalam sentence (താരം വ� tāram vannu Starcomes), the verb vannu inflected with u, which isthe present tense marker.Compounding acts as another challenge in Dra-

vidian languages where words can have a recursivecompound structure. There can be compounds em-bedded in a compound word, which itself can be-come another compound (Mohanan, 1986). For in-stance, in Malayalam (jātimātaviduveṣaṅṅaḷ,hatred of caste and religion) consist of first com-pound (jāti + māta, caste and religion), joinedwith other compound (viduveṣam, hatred) andplural inflection ṅṅaḷ.

3 Adaptor Grammars

Adaptor Grammars (AG) (Johnson et al., 2007a)is a non-parametric Bayesian framework for per-forming grammatical inference over parse trees.AG has two components—a PCFG (ProbabilisticContext Free Grammar) and an adaptor function.The PCFG specifies the grammar rules used togenerate the data. The adaptor function transformsthe probabilities of the generated parse trees so thatthe probabilities of the adapted parse trees may besubstantially larger than under the conditionally in-dependent PCFG model. Various Bayesian non-parametric models can be used as Adaptor func-tion, such as HDP (Teh et al., 2006). For instance,the Pitman-Yor Adaptor (Johnson et al., 2007a),

which is also used in this work, transforms theprobability of an adapted parse tree in such a waythat it is proportional to the number of times thistree has been observed elsewhere in the data. Weprovide an informal description of adaptor gram-mar here. An adaptor grammar consists of ter-minals V and non-terminals N , (including a startsymbol, S), and initial rule set R with probabil-ity p, like a Probabilistic Context Free Grammar(PCFG). A non-terminal A ∈ N , has got a vectorof concentration parametersα, whereαA >0. Thenwe say non-terminal A is adapted. If αA = 0 thenA is an unadapted non-terminal. A non-terminalA, which is unadpted expand as in PCFG but anadapted non-terminal A can expand in two ways:

1. A can expand to a subtree t with probabilitynt/nA + αA, where nt is the number of timesA has expanded to t before and

2. Expand as in PCFG considering the probabil-ity propositional to concentration parameterαA

Inference on this model can achieved using a sam-pling procedure . The formal definition of AGs canbe found in (Johnson et al., 2007a), details of theinference procedures are described in (Johnson etal., 2007b) and (Johnson and Goldwater, 2009).

4 AGs for morphological segmentationfor Dravidian languages

Dravidian languages are highly agglutinative,which means that a stem can be attached a se-quence of several suffixes and several words canbe concatenated together to form compounds. Thesegmentation model has to take these languageproperties into account.We can define a grammar reflecting the agglu-

tinative structure of language similar to the com-pounding grammar of (Sirts and Goldwater, 2013),excluding prefixes:

Word→ Compound+

Compound→ Stem Suffix∗

Stem→ SubMorphs+

Suffix→ SubMorphs+

(1)

Segmentation of long agglutinated sentences isthe aim of above described grammar, where weconsider that words are composed of words2 and

2The word ”compound” is used in our representation forwords

words can be composed of Stem and Suffix, whereStem and Suffix are adapted non terminals, whichare ”adapted” with Pitman-Yor Process (Pitmanand Yor, 1997). Both Stem and Suffix can gen-erated from drawing of Pitman-Yor process or byfollowing PCFG rule. If these non-terminals ex-pand according to PCFG rule, it expand to Sub-morphs, which is an intermediate levels added be-fore terminals, For more details refer (Sirts andGoldwater, 2013)

The Submorphs can be defined in the followingway

Submorphs→ Submorph

Submorphs→ Submorph Submorphs

Submorphs→ Chars

Chars→ Char

Chars→ CharChars

(2)

The Submorphs can be composed of singlemorph or Submorhs, which are combinations ofChar. In our case Char is our internal represen-tation for alpha-syllabic characters. The abovegrammar can generate various parse trees as a weput a Pitman-Yor prior on component. It is goingto produce most probable morphological segmen-tation based on the prior probabilities. For moredetails of this procedure, refer (Johnson and Gold-water, 2009)

This grammar enables representing long agglu-tinated phrases that are common in Dravidian lan-guages. For instance, an agglutinated Malayalamword phrase sansthānaṅṅaḷileānnāṇ with thecorrect morphological segmentation sansth +ānaṅṅaḷileā + nnāṇ can be represented using thegrammar.

Although this grammar uses the knowledgeabout the agglutinative nature of the language,it is otherwise knowledge-free because it doesn’tmodel the specific morphotactic regularities of theparticular language. Next, we experiment withgrammars that are specifically tailored for Dravid-ian languages and express the specific morphotac-tic patterns found in those languages. We lookat the regular morphological templates describedin linguistic textbooks (Krishnamurti, 1976) and(Steever, 2003) rather than generating just a se-quence of generic Suffixes.

0

1 2 3

5 6 7

4

m:m

m: m

a : a h : h

i: i

n : n r : r

a:a

y: i+a=ya

Figure 1: Example of sandhi rule in FST

4.1 Dealing with Sandhi

As explained above, the words in Dravidian lan-guages often undergo phonological changes whenmorphemes or compound parts are concatenatedtogether. Thus, in order to correctly express thesegmented words in script, it is necessary to modelthose changes properly.In this work we deal with sandhi during a post-

processing step where we apply to the segmentedwords a set of regular expression rules that de-scribe the valid sandhi rules. Our approach is sim-ilar to (Vempaty and Nagalla, 2011) where rulesfor orthographic changes are created. However,we use FST rules at the syllable level. Our methodworks with a general phonetic notation, which isthe same for both languages. For example, theMalayalam word (marannal , trees) is combinationof (maram tree + nnal plural marker). We create acontext sensitive rule for the orthographic change,which look like V → V m”+”||nnal. In the aboveV is set of all syllables in the languages. The samerule also stands for Kannada orthographic change.Similarly we create rule for orthographic changesdue to sandhi. One example of finite-state trans-ducer rule to handle addition sandhi is given in fig-ure 1. It handles the ya sandhi happens changeduring the insertion sandhi. We have 62 rules forMalayalam and 34 rules for Kannada for handlingsandhi changes. The statistics of the data is shownin 5.

5 Data and Experiments

We conduct our experiments on word lists ex-tracted fromWikipedia and newspaper’s websites.The statistics of the data sets are given in Table 6.Word list consist of 30 million tokens of Kannadaand 40 million tokens of Malayalam. The data setconsist of named entities, proper names and abbre-

Kannada Malayalam

Token frequency 30M 40MTypes 1M 1MLabeled 10k 10kRE Rules 62 34

Table 6: Statistics of the data sets.

viations. We also have 10k morphologically hand-segmented words, which act as our gold standardfile.In order to deal with complex orthographies

of Kannada and Malayalam, we have created ainternal representation, which is unique for bothlanguages. The conversion was done in fol-lowing way: the Malayalam word (അേതസമയം ,atēsamayam) converted in to a t h e s a m y am. During this process complex ligatures are con-verted into corresponding extended ASCII formatand put spaces between the characters. Similarlya Kannada word (ಮಧ½ದ, madhyada) converted tom a d y a d a. This representation allow us to usethe same grammar for both languages. The con-version of orthographic form to internal represen-tation is as follows. In the first step we have con-verted language’s scripts to corresponding ISO ro-manization. This representation helps in gettingunique values for various ligatures and compoundcharacters. Once the script is converted to ISO ro-manized form, we convert it into Extended ASCIIform with unique values for each characters. Aspart of our experiment, we converted all words inthe lists and morphological segmentations to ourinternal representation. For training the AG mod-els3, we use the scenarios proposed by (Sirts andGoldwater, 2013) we train the models using 10K,20K 40k, 50K, 80K most frequent word types ineach language with same grammar and segmentthe test data inductively with a parser using theAG posterior grammar as the trained model. Werun five independent experiments with each settingfor 1000 iterations after which we collect a singlesample for getting the AG posterior grammar.Using the trained models we segment our gold

standard. Once the AG posterior grammar pro-duce the morphological segmentation in internalformwe converted internal representation into cor-responding orthographic form for evaluation of re-sult. The process of converting the internal repre-

3Software available at http://web.science.mq.edu.au/~mjohnson/

sentation to orthographic form is as follows. Wetake the internal representation of a word one byone and we apply finite-state rule that takes careof sandhi and then it convert back to orthography.The number of finite-state rules, is listed as RErules in the Table 5.

6 Evaluation, Error analysis andDiscussion

The evaluation is based on how well the methodspredicts the morpheme boundaries in orthographicform and calculates precision, recall and F- score.We used Python suite provided in the morpho-challenge website4 for evaluation purposes Wealso train Morfessor baseline, Morfessor-CAPand Undivide, with 80K word types. We com-pare our results with several baselines that havebeen previously successfully used for aggluti-nated languages: Finnish and Turkish. For unsu-pervised baselines we use Morfessor Categories-MAP (Creutz and Lagus, 2007b) and Undivide(Dasgupta and Ng, 2007). We train MorfessorCategories-MAPwith the 80Kmost frequent wordtypes and produce a model. Using this model thegold standard file is segmented and the results arecompared with the manual segmentations. Thesame process is carried out in the case of Morfes-sor baseline. In the case of Undivide, we apply thesystem on the gold standard file and get the seg-mentation. We use Undivide software because itperformed very well in the case of highly inflectedIndian language Bengali. The results are evalu-ated by computing the segment boundary F1 score(F1) as is standard for morphological segmentationtask.The result achieved is presented in the table 7.

In the table (P) stands for Precision and (R) standsfor Recall and (F) stands for F-score.On the manual analysis of the predicted word

segmentations by our system and other baselines,we note the following:

• Our system was able to identify the sandhichanges and orthographic changes due tosandhi but other systems were unable to dothat because of lack knowledge of orthogra-phy and sandhi changes.

• In the case of compound characters, Mor-fessor, Morfessor- MAP and Undivide seg-mented it into two constituent character,

4http://research.ics.aalto.fi/events/morphochallenge/

which is not required. For example, theMalayalam character (�, nka) to (n) and (ka).

• All algorithms have divided compoundswords.

We did not evaluated the result produced by theAdaptor Grammar individually as we need the out-put in language’s script.

7 Conclusion and future research

We have presented a semi-supervised morphol-ogy learning technique that uses statistical mea-sures and linguistic rules. The result of the pro-posed method outperforms other state-of-art un-supervised morphology learning techniques. Themajor contribution of this paper is the use of samemodel of morphology for segmenting two mor-phologically complex languages and the sandhichanges in both the languages are handled usinga single finite-state transducer. In essence, we canconsider it as a hybrid system, which make use ofstatistical information and linguistic rules togetherto produce better results. The experiments showthat morphology of two complex languages can belearned jointly. Other important aspect of these ex-periments is that we tested Adaptor Grammars inthe case of complex Indian languages and showedthat it can be used in languages with complex mor-phology and orthography. The major aim of thestudy was to show a general model of morphol-ogy, which could be used to learn morphology oftwo languages. As further research, we intend totrain the system with larger number of tokens andevaluate the performance in the presence of largeamount of data. As we also noted an improvementin the performance when the number of word typeincreases.

Acknowledgments

We thank the anonymous reviewers for their care-ful reading of our manuscript and their many in-sightfulcomments and suggestions.

References

Shanmugam Agesthialingom and K KushalappaGowda. 1976. Dravidian case system. Number 39.Annamalai University.

Henning Andersen. 1986. Sandhi phenomena in thelanguages of Europe, volume 33. Walter de Gruyter.

Method Kannada Malayalam

P R F P R F

Undivide 40.98 47.17 43.86 64.08 27.12 38.22Morfessor-base 67.92 59.02 45.63 38.21 41.59 48.54Morfessor-MAP 70.32 53.77 62.1 62.64 47.11 53.77Adaptor Grammar 73.63 59.82 66.01 65.66 54.32 59.45

Table 7: Results for several systems

Mathias Creutz and Krista Lagus. 2007a. Unsuper-vised models for morpheme segmentation and mor-phology learning. ACM Transactions on Speech andLanguage Processing (TSLP), 4(1):3.

Mathias Creutz and Krista Lagus. 2007b. Unsuper-vised models for morpheme segmentation and mor-phology learning. ACM Transactions on Speech andLanguage Processing, 4(1):3:1–3:34.

Sajib Dasgupta and Vincent Ng. 2007. High-performance, language-independent morphologicalsegmentation. In HLT-NAACL, pages 155–163.

Harald Hammarström and Lars Borin. 2011. Unsuper-vised learning of morphology. Computational Lin-guistics, 37(2):309–350.

Mark Johnson and Sharon Goldwater. 2009. Improv-ing nonparametric Bayesian inference: Experimentson unsupervised word segmentation with AdaptorGrammars. In naacl09, pages 317–325.

Mark Johnson, Thomas L. Griffiths, and Sharon Gold-water. 2007a. Adaptor Grammars: a framework forspecifying compositional nonparametric Bayesianmodels. In Advances in Neural Information Pro-cessing Systems 19, pages 641–648.

Mark Johnson, Thomas L. Griffiths, and Sharon Gold-water. 2007b. Bayesian inference for PCFGs viaMarkov chain Monte Carlo. In HLT-NAACL, pages139–146.

Bhadriraju Krishnamurti. 1976. Comparative dravid-ian studies. Current Trends in Linguistics: Index,14:309.

Karuvannur P Mohanan. 1986. The theory of lexicalphonology: Studies in natural language and linguis-tic theory. Dordrecht: D. Reidel.

Jim Pitman and Marc Yor. 1997. The two-parameterpoisson-dirichlet distribution derived from a stablesubordinator. The Annals of Probability, pages 855–900.

Hoifung Poon, Colin Cherry, and Kristina Toutanova.2009. Unsupervised morphological segmentationwith log-linear models. In Proceedings of HumanLanguage Technologies: The 2009 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 209–217.Association for Computational Linguistics.

Kairit Sirts and Sharon Goldwater. 2013. Minimally-supervised morphological segmentation using adap-tor grammars. Transactions of the Association forComputational Linguistics, 1:255–266.

Sanford B Steever. 2003. The Dravidian Languages.Routledge.

Insup Taylor and David R Olson. 1995. Scripts andliteracy: Reading and learning to read alphabets,syllabaries, and characters, volume 7. Springer Sci-ence & Business Media.

Yee Whye Teh, Michael I Jordan, Matthew J Beal, andDavid M Blei. 2006. Hierarchical dirichlet pro-cesses. Journal of the american statistical associ-ation, 101(476).

Phani Chaitanya Vempaty and Satish Chandra PrasadNagalla. 2011. Automatic sandhi spliting methodfor telugu, an indian language. Procedia-Social andBehavioral Sciences, 27:218–225.

jointbayesianmorphologylearningfordravidianlanguages · malayalam അ എ ഒ ... grammar. although...

Documents