OSMAN Readability

Open Source tool for Arabic text readability

View the Project on GitHub drelhaj/OsmanReadability

OSMAN Readability Metric.

OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. It allows researchers to calculate readability for Arabic text using our new OSMAN readability score.

The tool was published as a full paper at LREC 2016 conference in Slovenia. [El-Haj, M., and Rayson. "OSMAN - A Novel Arabic Readability Metric". 10th edition of the Language Resources and Evaluation Conference (LREC'16). May 2016. Portoroz, Slovenia.] http://www.lancaster.ac.uk/staff/elhaj/docs/elhajlrec2016Arabic.pdf

To access code and for more details on how to download/run the tool please navigate to our OSMAN GitHub repository by clicking on View on GitHub button up there or via link below: OSMAN Readability GitHub Repository: https://github.com/drelhaj/OsmanReadability.

OSMAN Arabic Readability Formula

The formula calculates readability for Arabic text with and without diacritics (Tashkeel). The tool presents a novel way towards counting syllables in Arabic which has been a difficult task for many years. The tools provides accurate results for text with diacritics. As we are aware that the majority of Arabic text available online these days is written with the absence of diacritics, we provide the user with an option to use Mishkal sourceforge.net/mishkal/, which is a free online tool that adds diacritics back in to Arabic text, the tools reaches an accuracy over 85%.

Arabic Syllables!

In our tool we count the two main types of Arabic syllables, short and long in addition to stressed syllables. Short syllables are simply a single consonant followed by a single short vowel (e.g. “كَتَبَ” [ka-ta-ba], “he wrote”). A long syllable usually is a consonant plus a long vowel (e.g. “كِتَاب” [ki-taab], “book”) the example shows a short syllable followed by a long one. Stress syllables are those considered as double letters, indicating a double consonants with no vowel in between (e.g. “شَدَّدَ”ƒ, [shaDDaDa], “he stressed”).

Dataset

we used 73,000 parallel English and Arabic paragraphs from the United Nations (UN) corpus uncorpora.org/ – a collection of resolutions of the General Assembly from Volume I of GA regular sessions 55-62 (Rafalovitch and Dale, 2009). The Arabic text by the UN has been written with the absence of diacritics. We used Mishkal to add diacritics to the Arabic text. Each language has around 3 million words from more than 2,000 documents with each document containing 36 paragraphs on average.

Ref: A. Rafalovitch and R. Dale. 2009. United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In Proceedings of the MT Summit XII, pages 292–299. International Association of Machine Translation, August.