cs460/626 : natural language processing/speech, nlp and the web (lecture 36–syllabification and...

44
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay 11 th April, 2011

Upload: gordon-bryan

Post on 23-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

CS460/626 : Natural Language Processing/Speech, NLP and the Web

(Lecture 36–syllabification and transliteration)

Pushpak BhattacharyyaCSE Dept., IIT Bombay

11th April, 2011

Page 2: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Access from one language to another

Language L1

Translation

Language L2

TransliterationGandhiji गां��धी�जी�Boy

लड़का�

Page 3: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Cross Lingual Query

Query Translati

on

Search Engine

English Docume

nt

Translate Docume

nt

गां��धी� का� जीन्मस्था�न

Birthplace of Gandhi

Output

Page 4: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Sound based transliteration Transliteration

Script Script

Transliteration using pronunciation Convert script to sound Convert sound to script

Similar to interlingua based MT

MeaningSentence in source Language

Sentence in target Language

(गां��धी�जी�)(Gandhiji)

Page 5: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Two approaches to transliteration

Syllabification

Rule Based

Machine Learning Based

Word Syllabified

Apple ap ple

Fundamental fun da men tal

Use GIZA++ Moses on parallel corpus

Page 6: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Phoneme-based approach

Word inSource language

Pronunciationin

Source language

Word inTarget language

PronunciationIn

target language

P( ps | ws)

P ( pt | ps )

P ( wt | pt )

Note: Phoneme is the smallest linguistically distinctive unit of sound.

P(wt)

Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) )

Page 7: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Basics of syllables

“Syllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.”

Vowels are the heart of a syllable (Most Sonorous Element) (svayam raajate iti svaraH)

Consonants act as sounds attached to vowels.

Page 8: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Syllable structure

A syllable consists of 3 major parts:- Onset (C) Nucleus (V) Coda (C)

Vowels sit in the Nucleus of a syllable Consonants may get attached as

Onset or Coda. Basic structure - CV

Page 9: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Possible syllable structures The Nucleus is

always present Onset and Coda

may be absent Possible

structures V CV VC CVC

Page 10: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Introduction to sonority theory

“The Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.”

Some sounds are more sonorous Words in a language can be divided into

syllables Sonority theory distinguishes syllables

on the basis of sounds.

Page 11: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Sonority hierarchy Defined on the basis of amount of

sound associated The sonority hierarchy is as follows:-

Vowels (a, e, i, o, u) Liquids (y, r, l, v) Nasals (n, m) Fricatives (s, z, f,…..sh, th etc.) Affricates (ch, j) Stops (b, d, g, p, t, k)

Page 12: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Sonority scale Obstruents can

be further classified into:- Fricatives Affricates Stops

Page 13: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Sonority theory & syllables

“A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.”

Represented as waves of sonority or Sonority Profile of that syllable

Nucleus

Onset Coda

Page 14: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Sonority sequencing principle

“The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.”

Peak (Nucleus)

Onset Coda

Page 15: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Maximal onset principle

“The Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.”

Determines underlying syllable division

Example DIPLOMA

DIP LO MA & DI PLO MA

Page 16: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Syllable Structure: a more detailed look

Count of no. of syllables in a word is roughly/intuitively the no. of vocalic segments in a word.

Thus, presence of a vowel is an obligatory element in the structure of a syllable. This vowel is called “nucleus”.

Basic Configuration: (C)V(C). Part of syllable preceding the nucleus is called the

onset. Elements coming after the nucleus are called the

coda. Nucleus and coda together are referred to as the

rhyme.

S ≡ Syllable, O ≡ OnsetR ≡ Rhyme, N ≡ NucleusCo ≡ Coda

Page 17: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Syllable Structure: Examples

‘word’

‘sprint’

Page 18: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Syllable Structure: Examples

‘may’

‘opt’

‘air’

No Coda.

No Onset.

No Coda, No Onset.

Page 19: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Syllable Structure Open Syllable: ends in vowel Closed syllable: ends in consonant or consonant

cluster

Light Syllable: A syllable which is open and ends in a short vowel

General Description – CV. Example, ‘air’.

Heavy Syllable: Closed syllables or syllables ending in diphthong

Example: ‘opt’ Example, ‘may’

Page 20: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Syllabification: Determining Syllable Boundaries

Given a string of syllables (word), what is the coda of one and the onset of another?

In a sequence such as VCV, where V is any vowel and C is any consonant, is the medial C the coda of the first syllable (VC.V) or the onset of the second syllable (V.CV)?

To determine the correct groupings, there are some rules, two of them being the most important and significant:

Maximal Onset Principle, Sonority Hierarchy

Page 21: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

(This and statistical syllabification and application to Transliteration: DDP work

by Ankit Aggrwal, 2009)

Rule Based Syllabification

Page 22: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Phonotactics

Guides construction of onsets and codas

Restricts combination of phonemes.

In general, rules operate around the sonority hierarchy. E.g., Fricative /s/ is lower on the sonority hierarchy than the lateral /l/, so the

combination /sl/ is permitted in onsets and /ls/ is permitted in codas. Opposite is not allowed.

Thus, ‘slips’ and ‘pulse’ are possible English words. ‘lsips’ and ‘pusl’ are not possible.

Page 23: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Constraints on Onsets1

One-consonant: Only /ŋ/ cannot occur in syllable-initial position. Two-consonant: We refer to the scale of sonority.

Sequence ‘rn’ is ruled out since there is a decrease of sonority. Minimal Sonority Distance: Distance in sonority between the first and the

second element in the onset must be of at least 2 degrees. Thus, only a limited no. of possible two-consonant clusters.

Three-consonant: Restricted to licensed two-consonant onsets preceded by /s/. Also, /s/ can only be followed by a voiceless sound. Therefore, only /spl/, /spr/, /str/, /skr/, /spj/, /stj/, /skj/, /skw/,

/skl/, /smj/ will be allowed. (splinter, spray, strong etc.) While /sbl/, /sbr/, /sdr/, /sgr/, /sθr/ will be ruled out.

Page 24: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Constraints on Onsets2

Possible 2-consonat clusters in an Onset

Page 25: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Constraints on Coda1

Page 26: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Constraints on Coda2

Page 27: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Other Constraints

Nucleus: The following can occur as nucleus: All vowel sounds (monophthongs as well as diphthongs).

Syllabic: Both the onset and the coda are optional (as seen previously). /j/ at the end of an onset (/pj/, /bj/, /tj/, /dj/, /kj/, /fj/, /vj/, /θj/, /sj/,

/zj/, /hj/, /mj/, /nj/, /lj/, /spj/, /stj/, /skj/) must be followed by /uɪ/ or /ʊə/.

Long vowels and diphthongs are not followed by /ŋ/. /ʊ/ is rare in syllable-initial position. Stop + /w/ before /uɪ, ʊ, ʌ, aʊ/ are excluded.

Page 28: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Implementation1

1. Identify the first nucleus (single or multiple consecutive vowels) in the word.

2. Parse all the preceding consonants as the onset of the first syllable.

3. Find the next nucleus in the word. If unsuccessful in finding another nucleus, we'll simply parse the consonants to the right of the current nucleus as the coda of the first syllable, else we will move to the next step. Now the consonant cluster between these two nuclei have to be divided in two parts, one as coda of 1st syllable and the other onset of 2nd syllable.

4. If the no. of consonants in the cluster is one, it will be the onset of the second nucleus as per the Maximal Onset Principle and Constrains on Onset.

5. If it is two, check whether both of these can go to the onset of the second syllable as per the allowable onsets.

Page 29: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Implementation2

6. If not , the first consonant will be the coda of the first syllable and the second consonant will be the onset of the second syllable.

7. If the no. of consonants in the cluster is three, check if they can be the onset of the second syllable, if not, check for the last two, if not, parse only the last consonant as the onset of the second syllable.

8. If the no. of consonants in the cluster is more than three, except the last three consonants, parse all the consonants as the coda of the first syllable. With the remaining 3 consonants, apply step 7.

9. Continue the procedure for the rest of the string.

Page 30: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Accuracy Accuracy = (no. of words correctly syllabified x 100) %

no. of words

10,000 easy words were chosen to establish proof of concept 91 words were found to be incorrectly syllabified. Accuracy: 99.09%. High precision, low recall (a characteristic of knowledge based AI)

Page 31: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Statistical syllabification

Page 32: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Statistical Machine Syllabification

Data Sources of data

1. Election Commission of India (ECI) Name List : This web source provides native Indian names written in both English and Hindi.

2. Delhi University (DU) Student List : This web sources provides native Indian names written in English only. These names were manually transliterated for the purposes of training data.

3. Indian Institute of Technology, Bombay (IITB) Student List: The Academic Office of IITB provided this data of students who graduated in the year 2007.

4. Named Entities Workshop (NEWS) 2009 English-Hindi parallel names : A list of paired names between English and Hindi of size 11k is provided.

We start with 8,000 randomly chosen names from the ECI Name list and syllabify them manually.

Page 33: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Statistical Machine Syllabification

Effect of Training Format Syllable-separated Format

Syllable-marked Format

Future Work

Source Targets u d a k a r su da karc h h a g a n chha ganj i t e s h ji teshn a r a y a n na ra yans h i v shivm a d h a v ma dhavm o h a m m a d mo ham madj a y a n t e e d e v i ja yan tee de vi

Source Targets u d a k a r s u _ d a _ k a rc h h a g a n c h h a _ g a nj i t e s h j i _ t e s hn a r a y a n n a _ r a _ y a ns h i v s h i vm a d h a v m a _ d h a vm o h a m m a d m o _ h a m _ m a dj a y a n t e e d e v i j a _ y a n _ t e e _ d e _ v i

Page 34: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Statistical Machine Syllabification

Top-1 Accuracy: 71.8% (Syllable-separated); 80.5% (Syllable-marked)

Top-5 Accuracy: 83.4% (Syllable-separated); 90.4% (Syllable-marked)

60%65%70%75%80%85%90%95%

100%

1 2 3 4 5

Cum

ulati

ve A

ccur

acy

Accuracy Level

Syllable-separated Syllable-marked

Page 35: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Statistical Machine Syllabification

Effect of Data Size 4 different experiments were performed:

1. 8k: Names from the ECI Name list.2. 12k: An additional 4k names were syllabified to increase the data size.3. 18k: IITB Student List and the DU Student List was included and syllabified.4. 23k: Some more names from ECI Name List and DU Student List were

syllabified and this data acts as the final data for us.

93.8%97.5% 98.3% 98.5% 98.6%

70.0%

75.0%

80.0%

85.0%

90.0%

95.0%

100.0%

1 2 3 4 5

Cum

ulati

ve A

ccur

acy

Accuracy Level

8k 12k 18k 23k

Page 36: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Statistical Machine Syllabification

Effect of Language Model n-gram Order

85.0%

87.0%

89.0%

91.0%

93.0%

95.0%

97.0%

99.0%

1 2 3 4 5

Cum

ulati

ve A

ccur

acy

Accuracy Level

3-gram 4-gram 5-gram 6-gram 7-gram

Page 37: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Effect of Syllabification on Transliteration

Page 38: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Transliteration using syllabification

Moses

Moses

Syllabified data

(source–side)

Automatic Transliterator

Automatic Syllabifier

Syllabified data

(parallel)

Input String

Syllabified String

Transliterated String

Learnt syllabifier

Learnt transliterator

Page 39: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Statistical Machine Transliteration

Top-1 Accuracy: 60.1% (Syllable-separated); 50.2% (Syllable-marked) Top-5 Accuracy: 87.2% (Syllable-separated); 79.3% (Syllable-marked)

45%50%55%60%65%70%75%80%85%90%95%

100%

1 2 3 4 5 6

Cum

ulati

ve A

ccur

acy

Accuracy Level

Syllable-separated Syllable-marked

Page 40: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

The Top-5 Accuracy is just 63.0% which is much lower than required.

No syllabification: Baseline Performance

Top-n CorrectCorrect%age

Cumulative%age

1 1,868 41.5% 41.5%2 520 11.6% 53.1%3 246 5.5% 58.5%4 119 2.6% 61.2%5 81 1.8% 63.0%

Below 5 1,666 37.0% 100.0%4,500

40

Additional Additional

Page 41: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Manual, i.e., ideal syllabification: skyline Performance

Top-1 Accuracy: 71.8% (Syllable-separated); 80.5% (Syllable-marked)

Top-5 Accuracy: 87.4% (Syllable-separated); 90.4% (Syllable-marked)

60%65%70%75%80%85%90%95%

100%

1 2 3 4 5

Cum

ulati

ve A

ccur

acy

Accuracy Level

Syllable-separated Syllable-marked

41

Page 42: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

Obsevations

Transliteration likely to be helped by syllabification

Rule based syllabification: high precision low recall

Statistical syllabification: reasonable accuracy at top-5

Page 43: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

References

1. Nasreen AbdulJaleel and Leah S. Larkey. Statistical transliteration for english-arabic cross language information retrieval. In Conference on Information and Knowledge Management, pages 139–146, 2003.

2. Ann K. Farmer Andrian Akmajian, Richard M. Demers and Robert M. Harnish. Linguistics: An Introduction to Language and Communication. MIT Press, 5th edition, 2001.

3. Association for Computer Linguistics. Collapsed Consonant and Vowel Models: New Approaches for English-Persian Transliteration and Back-Transliteration, 2007.

4. Slaven Bilac and Hozumi Tanaka. Direct combination of spelling and pronunciation information for robust back-transliteration. In Conferences on Computational Linguistics and Intelligent Text Processing, pages 413–424, 2005.

5. Ian Lane Bing Zhao, Nguyen Bach and Stephan Vogel. A log-linear block transliteration model based on bi-stream hmms. HLT/NAACL-2007, 2007.

Page 44: CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 36–syllabification and transliteration) Pushpak Bhattacharyya CSE Dept., IIT Bombay

References6. H. L. Jin and K. F. Wong. A Chinese dictionary construction algorithm for

information retrieval. In ACM Transactions on Asian Language Information Processing, pages 281–296, December 2002.

7. K. Knight and J. Graehl. Machine transliteration. In Computational Linguistics, pages 24(4):599–612, Dec. 1998.

8. Lee-Feng Chien Long Jiang, Ming Zhou and Chen Niu. Named entity translation with web mining and transliteration. In International Joint Conference on Artificial Intelligence (IJCAL-07), pages 1629–1634, 2007.

9. Dan Mateescu. English Phonetics and Phonological Theory. 2003.10. Della Pietra P. Brown and R. Mercer. The mathematics of statistical machine

translation: Parameter estimation. In Computational Linguistics, page 19(2):263U˝ 311, 1990.

11. Ganapathiraju Madhavi Prahallad Lavanya and Prahallad Kishore. A simple approach for building transliteration editors for Indian languages. Zhejiang University SCIENCE-2005, 2005.