التشكيل الآلي للنص العربي

Automatic Generation of Arabic Diacritical Marks

 

The absence of the vowelization marks from the modern Arabic text represents a major obstacle in machine translation and other text understanding applications. In this paper we present a formulation of the problem of automatic generation of the Arabic diacritical marks from unvoweled text using a Hidden Markov Model (HMM) approach. The model considers the word sequence of unvoweled Arabic text as an observation sequence, and the possible diacritized expressions of the words as hidden states.  The optimal sequence of diacritized words (or states) is then obtained efficiently using a dynamic programming algorithm. We present the basic algorithm and its evaluation, and discuss its limitations as well as various ramifications for improving its performance.

 

1- Using Quran text for training and testing

     Quran_text Resources

Publications:

Moustafa Elshafei, Husni Al-Muhtaseb and Mansour Alghamdi, “Statistical Methods for Automatic Diacritization  of Arabic text”, Proceedings 18th National computer Conference, Riyadh, March 26-29, 2006ppt presentation

2- Using a database of Arabic Text (original text was provided by KACST)

Arabic_text Resources

Publication:

Moustafa Elshafei, Husni Al-Muhtaseb, and Mansour Alghamdi, “Machine Generation of Arabic Diacritical Marks”, Proceedings of  the 2006 International Conference on Machine Learning; Models, Technologies, and Applications (MLMTA'06),  June 2006, USA

Program source code (Tashkeel_v1     Brief Documentation     Source code for generating bigram