التشكيل الآلي للنص العربي
Automatic Generation of Arabic Diacritical Marks
The absence of the vowelization marks from the modern Arabic text represents a major obstacle in machine translation and other text understanding applications. In this paper we present a formulation of the problem of automatic generation of the Arabic diacritical marks from unvoweled text using a Hidden Markov Model (HMM) approach. The model considers the word sequence of unvoweled Arabic text as an observation sequence, and the possible diacritized expressions of the words as hidden states. The optimal sequence of diacritized words (or states) is then obtained efficiently using a dynamic programming algorithm. We present the basic algorithm and its evaluation, and discuss its limitations as well as various ramifications for improving its performance.
1- Using Quran text for training and testing
Publications:
Moustafa Elshafei, Husni Al-Muhtaseb and Mansour Alghamdi, “Statistical Methods for Automatic Diacritization of Arabic text”, Proceedings 18th National computer Conference, Riyadh, March 26-29, 2006; ppt presentation
2- Using a database of Arabic Text (original text was provided by KACST)
Publication:
Moustafa Elshafei, Husni Al-Muhtaseb, and Mansour Alghamdi, “Machine Generation of Arabic Diacritical Marks”, Proceedings of the 2006 International Conference on Machine Learning; Models, Technologies, and Applications (MLMTA'06), June 2006, USA
Program source code (Tashkeel_v1 Brief Documentation Source code for generating bigram