KALIMAT Multipurpose Arabic Corpus Wiki

A corpus that could be of help for researchers working on Arabic NLP

Brought to you by: melhaj

Home

KALIMAT a Multipurpose Arabic Corpus

We are pleased to announce the immediate availability of KALIMAT 1.0,

KALIMAT is an Arabic natural language resource that consists of:
1) 20,291 Arabic articles collected from the Omani newspaper Alwatan by (Abbas et al. 2011).
2) 20,291 Extractive Single-document system summaries.
3) 2,057 Extractive Multi-document system summaries.
4) 20,291 Named Entity Recognised articles.
5) 20,291 Part of Speech Tagged articles.
6) 20,291 Morphologically Analyse articles.

The data collection articles fall into six categories:
culture, economy, local-news, international-news, religion, and sports.

The process of creating KALIMAT was applied to the entire data collection (20,291 articles).
Firstly, we summarised the document collection using two Arabic summarisers, Gen–Summ and Arabic
Cluster-based. Gen-Summ (El-Haj et al. 2010) is a single document summariser based on the VSM model
(Salton et al. 1975) that takes an Arabic document and its first sentence and returns an extractive
summary. A number of 20,291 system summaries have been generated. Cluster-based (El-Haj et al. 2011)
is a multi-document summariser that treats all documents to be summarised as a single bag of sentences.
The sentences of all the documents are clustered using different number of clusters.
A summary is created by selecting sentences from the biggest cluster only (if there are two we select the
first biggest cluster). We generated 2,057 multi-document extractive system summaries with a summary for
each 10, 100 and 500 articles in each category, in addition to a summary for all the articles in each category.

Secondly, we used an Arabic Named Entity Recognition system (ANER) (Koulali and Meziane 2012)
to annotate the data collection.
To annotate the data collection we followed the Computational Natural Language Learning (CoNLL) 2002
and 2003 shared tasks formed by tags falling into any of the following four categories:
•Person Names: محمود درويش (Mahmoud Darwish).
•Location names: المغرب (Morocco).
•Organisation Names: الأمم المتحدة (United Nations).
•Miscellaneous Names: NEs not belonging to any of the previous classes and include date, time, number,
monetary expressions, measurement expressions and percentages. ANER system was trained using ANERCorpus
(Benajiba et al. 2007), a manually annotated corpus following the CoNLL shared task. The reason behind choosing
ANERCorpus to train our system was that the corpus articles were chosen from Arabic newswires and Wikipedia
Arabic, which is quite close to Alwatan’s data collection.

Thirdly, we used Stanford POSTagger (Toutanova et al. 2003) to annotate the 20,291 document collection.
The model for Arabic was trained using the Arabic Tree-bank p1-3 corpus based on maximum entropy and
using augmented Bies mapping of ATB tags. The POStagger identifies 33 part of speeches, using the Penn
Treebank project codification such as: Noun (NN), Plural Noun (NNS), Proper Noun (NNP), Verb (VB), Adjective (JJ).
The tagger reached an accuracy of 96.50%.

Finally, we applied a morphological analysis process on the data collection using Alkhalil morphological
analyser (Mazroui et al. 2011). The Analysis was carried out in the following steps: pre-processing (removal of diacritics)
and segmentation (each word is considered as [proclitic + stem + enclitic]).
Applying Alkhalil analyser on the data collection we reached an accuracy of 96%.
We implemented a Viterbi algorithm to get one solution that is relevant to the context of the analyzed article.

We provide KALIMAT for free including the articles, annotated text, entities and summaries to help
advancing the work on Arabic NLP.

The corpus can be downloaded directly from:
http://bit.ly/16jO3Ks [https://sourceforge.net/projects/kalimat/.]

The corpus and the results we achieved can be used by researchers as
gold-standards and or baselines to test and evaluate their Arabic tools.
We also welcome any amendments to the corpus by other researchers.
In our work we address the shortage of relevant data for Arabic natural
language processing, taking into consideration the lack of Arab participants
to come up with resources that are important for researchers working on Arabic NLP.

KALIMAT uses copyright material. Details of the terms of the
applicable copyrights are described in the file COPYRIGHT that
accompanies this resource. The sources of the documents is the
Omani Newspaper, Alwatan http://www.alwatan.com.

KALIMAT was created by
1-Mahmoud El-Haj m.el-haj@lancaster.ac.uk
http://www.lancs.ac.uk/staff/elhaj/
And
2-Rim Koulali rim.koulali@gmail.com

1-School of Computing and Communications, Lancaster University, Lancaster, Lancashire, UK.
2-LARI Laboratory, Mohammed 1 University, Oujda, Morocco.

Reference

Abbas, M., Smaili, K. and Berkani, D. 2011. “Evaluation of Topic Identification
Methods on Arabic Corpora”. Journal of Digital Information Management,vol.
9, N. 5, pp.185-192.

Al-Sulaiti, L., Atwell, ES. and Steven, E. 2006. “The design of a corpus of
Contemporary Arabic”. International Journal of Corpus Linguistics, 11(2):
135–171.

Benajiba, Y., Rosso, P. and BenedRuiz, J. 2007. Anersys: An arabic named entity
recognition system based on maximum entropy. Computational Linguistics and
Intelligent Text Processing, 143–153.

El-Haj, M., Kruschwitz, U. and Fox, C. 2010. “Using Mechanical Turk to Create a
Corpus of Arabic Summaries”. In The 7th International Language Resources and
Evaluation Conference (LREC 2010)., pages 36–39, Valletta, Malta,. LREC.

El-Haj, M., Kruschwitz, U. and Fox, C. 2011. “Exploring Clustering for
Multi-Document Arabic Summarisation”. In The 7th Asian Information Retrieval
Societies (AIRS 2011), volume 7097 of Lecture Notes in Computer Science, pages
550–561. Springer Berlin / Heidelberg.

Koulali, R. and Meziane, A. 2012. “A contribution to Arabic Named Entity
Recognition”. In ICT and Knowledge Engineering. ICT Knowledge Engineering,
pages 46–52.

Mazroui, A., Meziane, A., Ould Abdallahi Ould Bebah, M., Boudlal, A., Lakhouaja,
A and Shoul, M. 2011. ALkhalil morphosys: Morphosyntactic analysis system for
non voalized Arabic. In Proceeding of the 7th International Computing Conference
in Arabic.

Salton G., Wong A. and Yang, S. 2003. “A Vector Space Model for Automatic
Indexing”. Proceedings of the Communications of the ACM, 18(11):613–620, 1975.

Toutanova, K., Klein, D., Manning, C.D. and Singer, Y. 2003. “Feature-Rich
Part-Of-Speech Tagging With a Cyclic Dependency Network”. In Proceedings
of the 2003 Conference of the North American Chapter of the Association for
Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03,
pages 173–180.