Download Latest Version arabicstopwords0.3.zip (255.8 kB)
Email in envelope

Get an email when there's a new version of Arabic Stop words

Home
Name Modified Size InfoDownloads / Week
readme 2010-12-04 2.9 kB
arabicstopwords0.3.zip 2010-12-04 255.8 kB
arabicstopwords0.2.zip 2009-09-06 715.5 kB
Totals: 3 Items   974.1 kB 4
#INSTALL
------------------
Arabic Stop words
--------------------
- This  list can be reused, 
It't not easy to detemine the stop words, and in other hand, stop words differs according to the case,
for this purpos, we propose a  classified list
which can be parametered by  developper 
The Word list contains only wonds in its commun forms,
and we have generated all forms by a script.


Files
------
data/ : contains  data of stopwords
data/classified/stopwords.cvs: the data file as csv
data/classified/stopwords.xls: data in Excel fomat with more valuble informations, and classified stopwords
data/allforms/stopwordsallforms.sql: all forms database in sql format 
data/allforms/stopwords_allforms.txt: data generated from minimal data file
data/allforms/stopwordsallforms.py: all forms data as python dictionary 
tools/: scripts used to generate all forms from minimal data 
		usage : 
			generate_stopwords_forms.py -f data/stopwords.cvs  > output_file.txt
		Note: to avoid program to treat some data, comment lines by #, in the data file
		Note: script can be custumed

Data Structure
--------------
All forms data .CSV file 
	1st field : unvocalised word ( Ýí)
	2nd field : unvocalised stemmed word with -'-' between affixes: e.g. Ý-È-ÎãÓíä-í

	
	Minimal classified  data .CSV file 
	1st field : unvocalised word ( Ýí)
	2nd field : type of the word: e.g. ÍÑÝ
	3rd field : class of word : e.g. preposition 
	Affixation infomration in other fields:
		4th field : AIN in arabic , if word accept Conjuction 'ÇáÚØÝ', '*' else
		5th field : TEH in arabic , if word accept definate article 'Çá ÇáÊÚÑíÝ', '*' else
		6th field : JEEM in arabic , if word accept preposition  article 'ÍÑæÝ ÇáÌÑ ÇáãÊÕáÉ', '*' else		
		7th field : DAD in arabic , if word accept IDAFA  articles 'ÇáÖãÇÆÑ ÇáãÊÕáÉ', '*' else				
		7th field : SAD in arabic , if word accept verb conjugation  articles 'ÇáÊÕÑíÝ', '*' else		
		8th field : LAM in arabic , if word accept LAM QASAM   articles 'áÇã ÇáÞÓã', '*' else		
		8th field : MEEM in arabic , if word has ALEF LAM as definition article 'ãÚÑÝ', '*' else		

How to custum stop word list
---------------
1- check the minimal form data file ( stopwords.csv)
2- comment by "#" all words which you don't need
3- run generate_stopwords_forms.py script
4- catch the output of script.

Generation script usage:
------------------------
Usage: generate_stopwords_forms -f filename [OPTIONS]
	[-h | --help]		outputs this usage message
	[-V | --version]	program version
	[-f | --file= filename]	input file to generate_stopwords_forms
	[-o | --out= output format]	output format(csv,python,sql)


How to add a word into  word list
---------------
1- check if the word doesn't exist in the minimal form data file ( stopwords.csv)
2- add affixation information
3- run generate_stopwords_forms.py script
4- catch the output of script.

Thanks
 
Source: readme, updated 2010-12-04