With the increasing demand for access to content in foreign languages in recent years, we have also seen a steady improvement in the quality of tools that can help bridge this gap. One such tool is Statistical Machine Translation (SMT), which learns automatically from real examples of human translations, without the need for manual intervention. Training such a system takes just a few days, sometimes even hours, but requires a lot of sentences aligned to their corresponding translations, a resource known as a bi-text.

Such bi-texts contain translations of written texts as they are typically derived from newswire, administrative, technical and legislation documents, e.g., from the EU and UN. However, with the widespread use of mobile phones and online conversation programs such as Skype as well as personal assistants such as Siri, there is a growing need for spoken language recognition, understanding, and translation. Unfortunately, most bi-texts are not very useful for training a spoken language SMT system as the language they cover is written, which differs from speech in style, formality, vocabulary choice, length of utterances, etc.

It turns out that there exists a growing community-generated source of spoken language translations, namely movie subtitles. These come in plain text in a common format in order to facilitate rendering the text segments accordingly. The dark side of subtitles is that they are usually created for pirated copies of copyright-protected movies. Yet, their use in research is an exploitation of a “positive side effect” of Internet movie piracy, which allows for easy creation of spoken bi-texts in a number of languages. This alignment typically relies on a key property of movie subtitles, namely the temporal indexing of subtitle segments, among with other features.

Due to the nature of movies, subtitles differ from other resources in several aspects: they are mostly transcriptions of movie dialogues that are often spontaneous speech, which contains a lot of slang, idiomatic expressions, and also fragmented spoken utterances, with repetitions, errors and corrections, rather than grammatical sentences; thus, this material is commonly summarised in the subtitles, rather than being literally transcribed. Since subtitles are user-generated, the translations are free, incomplete and dense (due to summarization and compression) and, therefore, reveal cultural differences. Degrees of rephrasing and compression vary across languages and also depend on subtitling traditions. Moreover, subtitles are created to be displayed in parallel to a movie in order to be linked to the movie's actual sound signal. Subtitles also arbitrarily include some meta information such as the movie title, year of release, genre, subtitle author/translator details and trailers. They may also contain visual translation, e.g., into a sign language. Certain versions of subtitles are especially compiled for the hearing-impaired to include extra information about non-spoken sounds that are either primary, e.g., coughing, or secondary background noises, e.g., soundtrack music, street noise, etc. This brings yet another challenge to the alignment process: the complex mappings caused by many deletions and insertions. Furthermore, subtitles must be short enough to fit the screen in a readable manner and are only shown for a short time period, which presents a new constraint to the alignment of different languages with different visual and linguistic features.

The languages a subtitle file is available for differ from one movie to another. Commonly, the Arabic language, even though spoken by more than 420 million people worldwide, and being the 5th most spoken language worldwide, has relatively scarce online presence. For example, according to Wikipedia's statistics of article counts, Arabic is ranked 23rd. Yet, Web traffic analytics shows that search queries for Arabic subtitles and traffic from the Arabic region are among the highest. This increase in demand for Arabic content is not surprising with the recent dramatic economic and socio-political shift in the Arab World. On another note, Arabic, as a Semitic language, has a complex morphology, which requires special handling when mapping it to another language and therefore poses a challenge for machine translation.

In this work, we look at movie subtitles as a unique source of bi-texts in an attempt to align as many translations of movies as possible in order to improve English to Arabic SMT. Translating from English into Arabic is an underexplored translation direction and, due to the morphological richness of Arabic among with other factors, yields significantly lower results compared to translating in the opposite direction (Arabic to English).

For our experiments, we collected pairs of English-Arabic subtitles for more than 29,000 movies/TV shows, which is a collection that is bigger than any preexisting subtitle data set. We designed a sequence of heuristics to eliminate the inherent noise that comes with the subtitles' source in order to yield good quality alignment. We used time overlap to align the subtitles by utilising the time information provided within the subtitle files and measuring the time overlap. This alignment approach is language-independent and outperforms other traditional approaches such as the length-based approach, which relies on segment boundaries to match translation segments, as segment boundaries differ from one language to another, e.g., because of the need to fit the text on the screen.

Our goal was to maximise the number of aligned sentence pairs while minimising the alignment errors. We evaluated our models relatively and also extrinsically, i.e., by measuring the quality of an SMT system that used this bi-text for training. We automatically evaluated our SMT systems using BLEU, a standard measure for machine translation evaluation. We also implemented an in-house Web application tool in order to crowd-source human judgments comparing the SMT baseline's output and our best-performing system's output.

Our experiments yielded bi-texts of varied size and relative quality, which we used to train an SMT system. Adding any of our bi-texts improved the baseline SMT system, which was trained on TED talks from the IWSLT 2013 competition. Ultimately, our best SMT system outperformed the baseline by about two BLEU points, which is a very significant improvement, clearly visible to humans; this was confirmed in manual evaluation. We hope that the resulting subtitles corpus, the largest collected so far (about 82 million words), will facilitate research in spoken language SMT.


Article metrics loading...

Loading full text...

Full text loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error