Pages

Saturday, July 22, 2017

Sanskrit Sandhi Splitter using a Statistical Approach

Sanskrit Sandhi Splitter using a Statistical Approach

Back again after a long time here.

New experiment which I have started: Developing a Sanskrit Sandhi Splitter using a Statistical Approach.

Aim of the Project
- To be able to split and/or analyze Sandhi's in Anugita / अनुगीता
- Expected Accuracy: >= 80%
- False Positives: <= 5%

Anugita text is available here: http://sanskritdocuments.org/doc_giitaa/anugiitaa.itx

Approach
1. We will be using Bhagavad Gita because Sandhi Vigraha and Anavaya text is readily available in many places.
See: http://sanskritdocuments.org/doc_giitaa/gitAanvayasandhivigraha.pdf

2. Analyze each chapter one by one.

3. In each chapter, analyze and list every possible sandhi viccheda, and sort by frequency.

For example in chapter 1, the 5 most common sandhi rules that have been used are:
1. m<blank> -> M<blank>
2. m<blank>a - > ma
3. H<blank>cha -> shcha
4. a<blank>a -> A
5. aH<blank>a -> o.a

4. Once this list is ready, apply it to the Anugita text.

For example, using Rule 1
tasyAM sabhAyAM ramyAyAM -> tasyAm sabhAyAm ramyAyAm

5. During this checking process, cross-check the word with the database. If the word is already there in the database, there is no need to do sandhi viccheda. Else the word becomes a potential candidate for analysis.

6. There may be false positives which must be addressed, by continuously updating the database.

In an extreme case, if the dictionary is blank, we may get erroneous results/ false positives as below.

rAmo.ashvamapashyat / रामोऽश्वमपश्यत्
- > rAmaH ashvamapashyat / रामः अश्वमपश्यत् (using Rule 1)
-> rAm aH ashvam apashyat / राम् अः अश्वम् अपश्यत् (using Rule 2)
-> ra am aH ashvam  apashyat / र अम् अः अश्वम्  अपश्यत् (using Rule 4)

7. Update the sandhi frequency list chapter by chapter.

8. After Bhagavad Gita is over, start with Hitopadesha.

I will post sandhi results after each run in the following 18 posts, corresponding to the increasing size of frequency list over 18 chapters of Gita.

Only time will tell whether a statistical approach to splitting sandhi will bear fruits or not.

Thanks! Keep coming back for updates.

No comments:

Post a Comment