cs460/626 : natural language processing/speech, nlp and the web (lecture 36–syllabification and...

CS460/626 : Natural Language Processing/Speech, NLP and the Web

(Lecture 36–syllabification and transliteration)

Pushpak BhattacharyyaCSE Dept., IIT Bombay

11th April, 2011

Access from one language to another

Language L1

Translation

Language L2

TransliterationGandhiji गां��धी�जी�Boy

लड़का�

Cross Lingual Query

Query Translati

on

Search Engine

English Docume

nt

Translate Docume

nt

गां��धी� का� जीन्मस्था�न

Birthplace of Gandhi

Output

Sound based transliteration Transliteration

Script Script

Transliteration using pronunciation Convert script to sound Convert sound to script

Similar to interlingua based MT

MeaningSentence in source Language

Sentence in target Language

(गां��धी�जी�)(Gandhiji)

Two approaches to transliteration

Syllabification

Rule Based

Machine Learning Based

Word Syllabified

Apple ap ple

Fundamental fun da men tal

Use GIZA++ Moses on parallel corpus

Phoneme-based approach

Word inSource language

Pronunciationin

Source language

Word inTarget language

PronunciationIn

target language

P( ps | ws)

P ( pt | ps )

P ( wt | pt )

Note: Phoneme is the smallest linguistically distinctive unit of sound.

P(wt)

Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) )

Basics of syllables

“Syllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.”

Vowels are the heart of a syllable (Most Sonorous Element) (svayam raajate iti svaraH)

Consonants act as sounds attached to vowels.

Syllable structure

A syllable consists of 3 major parts:- Onset (C) Nucleus (V) Coda (C)

Vowels sit in the Nucleus of a syllable Consonants may get attached as

Onset or Coda. Basic structure - CV

Possible syllable structures The Nucleus is

always present Onset and Coda

may be absent Possible

structures V CV VC CVC

Introduction to sonority theory

“The Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.”

Some sounds are more sonorous Words in a language can be divided into

syllables Sonority theory distinguishes syllables

on the basis of sounds.

Sonority hierarchy Defined on the basis of amount of

sound associated The sonority hierarchy is as follows:-

Vowels (a, e, i, o, u) Liquids (y, r, l, v) Nasals (n, m) Fricatives (s, z, f,…..sh, th etc.) Affricates (ch, j) Stops (b, d, g, p, t, k)

Sonority scale Obstruents can

be further classified into:- Fricatives Affricates Stops

Sonority theory & syllables

“A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.”

Represented as waves of sonority or Sonority Profile of that syllable

Nucleus

Onset Coda

Sonority sequencing principle

“The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.”

Peak (Nucleus)

Onset Coda

Maximal onset principle

“The Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.”

Determines underlying syllable division

Example DIPLOMA

DIP LO MA & DI PLO MA

Syllable Structure: a more detailed look

Count of no. of syllables in a word is roughly/intuitively the no. of vocalic segments in a word.

Thus, presence of a vowel is an obligatory element in the structure of a syllable. This vowel is called “nucleus”.

Basic Configuration: (C)V(C). Part of syllable preceding the nucleus is called the

onset. Elements coming after the nucleus are called the

coda. Nucleus and coda together are referred to as the

rhyme.

S ≡ Syllable, O ≡ OnsetR ≡ Rhyme, N ≡ NucleusCo ≡ Coda

Syllable Structure: Examples

‘word’

‘sprint’

Syllable Structure: Examples

‘may’

‘opt’

‘air’

No Coda.

No Onset.

No Coda, No Onset.

Syllable Structure Open Syllable: ends in vowel Closed syllable: ends in consonant or consonant

cluster

Light Syllable: A syllable which is open and ends in a short vowel

General Description – CV. Example, ‘air’.

Heavy Syllable: Closed syllables or syllables ending in diphthong

Example: ‘opt’ Example, ‘may’

Syllabification: Determining Syllable Boundaries

Given a string of syllables (word), what is the coda of one and the onset of another?

In a sequence such as VCV, where V is any vowel and C is any consonant, is the medial C the coda of the first syllable (VC.V) or the onset of the second syllable (V.CV)?

To determine the correct groupings, there are some rules, two of them being the most important and significant:

Maximal Onset Principle, Sonority Hierarchy

(This and statistical syllabification and application to Transliteration: DDP work

by Ankit Aggrwal, 2009)

Rule Based Syllabification

Phonotactics

Guides construction of onsets and codas

Restricts combination of phonemes.

In general, rules operate around the sonority hierarchy. E.g., Fricative /s/ is lower on the sonority hierarchy than the lateral /l/, so the

combination /sl/ is permitted in onsets and /ls/ is permitted in codas. Opposite is not allowed.

Thus, ‘slips’ and ‘pulse’ are possible English words. ‘lsips’ and ‘pusl’ are not possible.

Constraints on Onsets1

One-consonant: Only /ŋ/ cannot occur in syllable-initial position. Two-consonant: We refer to the scale of sonority.

Sequence ‘rn’ is ruled out since there is a decrease of sonority. Minimal Sonority Distance: Distance in sonority between the first and the

second element in the onset must be of at least 2 degrees. Thus, only a limited no. of possible two-consonant clusters.

Three-consonant: Restricted to licensed two-consonant onsets preceded by /s/. Also, /s/ can only be followed by a voiceless sound. Therefore, only /spl/, /spr/, /str/, /skr/, /spj/, /stj/, /skj/, /skw/,

/skl/, /smj/ will be allowed. (splinter, spray, strong etc.) While /sbl/, /sbr/, /sdr/, /sgr/, /sθr/ will be ruled out.

Constraints on Onsets2

Possible 2-consonat clusters in an Onset

Constraints on Coda1

Constraints on Coda2

Other Constraints

Nucleus: The following can occur as nucleus: All vowel sounds (monophthongs as well as diphthongs).

Syllabic: Both the onset and the coda are optional (as seen previously). /j/ at the end of an onset (/pj/, /bj/, /tj/, /dj/, /kj/, /fj/, /vj/, /θj/, /sj/,

/zj/, /hj/, /mj/, /nj/, /lj/, /spj/, /stj/, /skj/) must be followed by /uɪ/ or /ʊə/.

Long vowels and diphthongs are not followed by /ŋ/. /ʊ/ is rare in syllable-initial position. Stop + /w/ before /uɪ, ʊ, ʌ, aʊ/ are excluded.

Implementation1

1. Identify the first nucleus (single or multiple consecutive vowels) in the word.

2. Parse all the preceding consonants as the onset of the first syllable.

3. Find the next nucleus in the word. If unsuccessful in finding another nucleus, we'll simply parse the consonants to the right of the current nucleus as the coda of the first syllable, else we will move to the next step. Now the consonant cluster between these two nuclei have to be divided in two parts, one as coda of 1st syllable and the other onset of 2nd syllable.

4. If the no. of consonants in the cluster is one, it will be the onset of the second nucleus as per the Maximal Onset Principle and Constrains on Onset.

5. If it is two, check whether both of these can go to the onset of the second syllable as per the allowable onsets.

Implementation2

6. If not , the first consonant will be the coda of the first syllable and the second consonant will be the onset of the second syllable.

7. If the no. of consonants in the cluster is three, check if they can be the onset of the second syllable, if not, check for the last two, if not, parse only the last consonant as the onset of the second syllable.

8. If the no. of consonants in the cluster is more than three, except the last three consonants, parse all the consonants as the coda of the first syllable. With the remaining 3 consonants, apply step 7.

9. Continue the procedure for the rest of the string.

Accuracy Accuracy = (no. of words correctly syllabified x 100) %

no. of words

10,000 easy words were chosen to establish proof of concept 91 words were found to be incorrectly syllabified. Accuracy: 99.09%. High precision, low recall (a characteristic of knowledge based AI)

Statistical syllabification

Statistical Machine Syllabification

Data Sources of data

1. Election Commission of India (ECI) Name List : This web source provides native Indian names written in both English and Hindi.

2. Delhi University (DU) Student List : This web sources provides native Indian names written in English only. These names were manually transliterated for the purposes of training data.

3. Indian Institute of Technology, Bombay (IITB) Student List: The Academic Office of IITB provided this data of students who graduated in the year 2007.

4. Named Entities Workshop (NEWS) 2009 English-Hindi parallel names : A list of paired names between English and Hindi of size 11k is provided.

We start with 8,000 randomly chosen names from the ECI Name list and syllabify them manually.


Effect of Training Format Syllable-separated Format

Syllable-marked Format

Future Work

Source Targets u d a k a r su da karc h h a g a n chha ganj i t e s h ji teshn a r a y a n na ra yans h i v shivm a d h a v ma dhavm o h a m m a d mo ham madj a y a n t e e d e v i ja yan tee de vi

Source Targets u d a k a r s u _ d a _ k a rc h h a g a n c h h a _ g a nj i t e s h j i _ t e s hn a r a y a n n a _ r a _ y a ns h i v s h i vm a d h a v m a _ d h a vm o h a m m a d m o _ h a m _ m a dj a y a n t e e d e v i j a _ y a n _ t e e _ d e _ v i


Top-1 Accuracy: 71.8% (Syllable-separated); 80.5% (Syllable-marked)


60%65%70%75%80%85%90%95%

100%

1 2 3 4 5

Cum

ulati

ve A

ccur

acy

Accuracy Level

Syllable-separated Syllable-marked


Effect of Data Size 4 different experiments were performed:

1. 8k: Names from the ECI Name list.2. 12k: An additional 4k names were syllabified to increase the data size.3. 18k: IITB Student List and the DU Student List was included and syllabified.4. 23k: Some more names from ECI Name List and DU Student List were

syllabified and this data acts as the final data for us.

93.8%97.5% 98.3% 98.5% 98.6%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

1 2 3 4 5

Cum

ulati

ve A

ccur

acy

Accuracy Level

8k 12k 18k 23k


Effect of Language Model n-gram Order

85.0%

87.0%

89.0%

91.0%

93.0%

95.0%

97.0%

99.0%

1 2 3 4 5

Cum

ulati

ve A

ccur

acy

Accuracy Level

3-gram 4-gram 5-gram 6-gram 7-gram

Effect of Syllabification on Transliteration

Transliteration using syllabification

Moses

Moses

Syllabified data

(source–side)

Automatic Transliterator

Automatic Syllabifier

Syllabified data

(parallel)

Input String

Syllabified String

Transliterated String

Learnt syllabifier

Learnt transliterator

Statistical Machine Transliteration

Top-1 Accuracy: 60.1% (Syllable-separated); 50.2% (Syllable-marked) Top-5 Accuracy: 87.2% (Syllable-separated); 79.3% (Syllable-marked)

45%50%55%60%65%70%75%80%85%90%95%

100%

1 2 3 4 5 6

Cum

ulati

ve A

ccur

acy

Accuracy Level


The Top-5 Accuracy is just 63.0% which is much lower than required.

No syllabification: Baseline Performance

Top-n CorrectCorrect%age

Cumulative%age

1 1,868 41.5% 41.5%2 520 11.6% 53.1%3 246 5.5% 58.5%4 119 2.6% 61.2%5 81 1.8% 63.0%

Below 5 1,666 37.0% 100.0%4,500

40

Additional Additional

Manual, i.e., ideal syllabification: skyline Performance



60%65%70%75%80%85%90%95%

100%

1 2 3 4 5

Cum

ulati

ve A

ccur

acy

Accuracy Level


41

Obsevations

Transliteration likely to be helped by syllabification

Rule based syllabification: high precision low recall

Statistical syllabification: reasonable accuracy at top-5

References

1. Nasreen AbdulJaleel and Leah S. Larkey. Statistical transliteration for english-arabic cross language information retrieval. In Conference on Information and Knowledge Management, pages 139–146, 2003.

2. Ann K. Farmer Andrian Akmajian, Richard M. Demers and Robert M. Harnish. Linguistics: An Introduction to Language and Communication. MIT Press, 5th edition, 2001.

3. Association for Computer Linguistics. Collapsed Consonant and Vowel Models: New Approaches for English-Persian Transliteration and Back-Transliteration, 2007.

4. Slaven Bilac and Hozumi Tanaka. Direct combination of spelling and pronunciation information for robust back-transliteration. In Conferences on Computational Linguistics and Intelligent Text Processing, pages 413–424, 2005.

5. Ian Lane Bing Zhao, Nguyen Bach and Stephan Vogel. A log-linear block transliteration model based on bi-stream hmms. HLT/NAACL-2007, 2007.

References6. H. L. Jin and K. F. Wong. A Chinese dictionary construction algorithm for

information retrieval. In ACM Transactions on Asian Language Information Processing, pages 281–296, December 2002.

7. K. Knight and J. Graehl. Machine transliteration. In Computational Linguistics, pages 24(4):599–612, Dec. 1998.

8. Lee-Feng Chien Long Jiang, Ming Zhou and Chen Niu. Named entity translation with web mining and transliteration. In International Joint Conference on Artificial Intelligence (IJCAL-07), pages 1629–1634, 2007.

9. Dan Mateescu. English Phonetics and Phonological Theory. 2003.10. Della Pietra P. Brown and R. Mercer. The mathematics of statistical machine

translation: Parameter estimation. In Computational Linguistics, page 19(2):263U˝ 311, 1990.

11. Ganapathiraju Madhavi Prahallad Lavanya and Prahallad Kishore. A simple approach for building transliteration editors for Indian languages. Zhejiang University SCIENCE-2005, 2005.

cs460/626 : natural language processing/speech, nlp and the web (lecture 36–syllabification and...

Documents