Word segmentation is a necessary step for natural language processing applications, such as machine translation and parsing. In this research we focus on Arabic word segmentation to study its impact on Arabic to English translation. There are accurate word segmentation systems for Arabic, such as MADA (Habash, 2007). However, such systems usually need manually-built data and rules of the Arabic language. In this work, we look at unsupervised word segmentation systems to see how well they perform on Arabic, without relying on any linguistic information about the language. The methodology of this research can be applied to many other morphologically complex languages. We focus on three leading unsupervised word segmentation systems proposed in the literature: Morfessor (Creutz and Lagus, 2002), ParaMor (Monson, 2007), and Demberg's system (Demberg, 2007). We also use two different segmentation schemes of the state of the art MADA and compare their precision with the unsupervised systems. After training the three unsupervised segmentation systems, we apply their resulting models to segment the Arabic part of the parallel data for Arabic to English statistical machine translation (SMT) and measure its impact on translation quality. We also build segmentation models using the two schemes of MADA on SMT to compare against the baseline system. The 10-fold cross validation results indicate that unsupervised segmentation systems turn out to be usually inaccurate with a precision that is less than 40%, and hence do not help with improving SMT quality. We also observe both segmentation schemes of MADA have very high precision. We experimented with two MADA schemes. A scheme with a measured segmentation framework improved the translation accuracy. A second scheme which performs more aggressive segmentation, failed to improve SMT quality. We also provide some rule-based supervision to correct some of the errors in our best unsupervised models. While this framework performs better than the baseline unsupervised systems, it still does not outperform the baseline MT quality. We conclude that in our unsupervised framework, the noise by the unsupervised segmentation offsets the potential gains that segmentation could provide to MT. We conclude that a measured supervised word segmentation improves Arabic to English quality. In contrast aggressive and exhaustive segmentation introduces new noise to the MT framework and actually harms its quality. This publication was made possible by the generous support of the Qatar Foundation through Carnegie Mellon University's Seed Research program provided to Kemal Oflazer. The statements made herein are solely the responsibility of the authors.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error