ARABICA is a computer program for the unsupervised learning of root-and-pattern morphology, a kind of structure that is typically found in laguages such as Arabic or Hebrew. The program attempts to analyze each word of a corpus into a consonantal root and a vocalic scheme. To do this, it first applies the algorithm described by Sukhotin (1962, 1973) to identify consonants and vowels based only on their distribution. Based on this categorization of symbols, the program uses principles of analogy and parcimony to decompose each word into a root and a scheme, and to identify the rules that govern the combination of these morphs. These rules are expressed by means of a particular kind of finite-state automata, called RP structures. The learning algorithm used by ARABICA relies on the principle of minimum description length or MDL (Rissanen, 1989), as pioneered by John Goldsmith's LINGUISTICA program (2001, 2006).
ARABICA is implemented as a Perl script named Arabica1.0.pl. This program is free software, distributed under the GNU General Public License (GPL). It can be freely downloaded form this page, along with a small list of arabic nouns that can be used for testing purposes. For more details about the program and the wordlist, see the file Instructions.txt and Xanthos (2007).
Arabica1.0.zip (28 Kb)
Goldsmith, John A. (2001). Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2), 153-198.
Goldsmith, John A. (2006). An algorithm for the unsupervised learning of morphology. Natural Language Engineering, 12(4), 353-371.
Rissanen, Jorma. (1989). Stochastic Complexity in Statistical Inquiry. Singapore : World Scientific Publishing Co.
Sukhotin, Boris V. (1962). Eksperimental'noe vydelenie klassov bukv s pomoščju EVM. Problemy strukturnoj lingvistiki, 234, 189-206.
Sukhotin, Boris V. (1973). Méthode de déchiffrage, outil de recherche en linguistique. T.A. Informations, 2, 1-43.
Xanthos, Aris (2007). Apprentissage automatique de la morphologie: le cas des structures racine-schème. Unpublished PhD thesis, University of Lausanne.