USAGE OF STATISTICAL MACHINE TRANSLATION IN TEXTUAL TRANSLATION
The aim of this paper was to explore the possibility of obtaining good performances from SMT approaching the problem from two main points of view: 1) by using very small training sets rather than huge quantities of (mostly) out-of-domain data, and 2) getting to know the nature of parallel data under the point of view of their text varieties (above all domain), in order to better understand which documents are the most suitable to be used as training data for specific translation tasks. Limiting the quantity of training data when building SMT systems can give several advantages, such as the use of fewer computational resources (compared to the use of larger quantities of data), experiencing little or no loss in terms of translation performance, in some cases even better results. Discriminating between documents belonging to different textual varieties has been previously explored, but the present paper wanted to further address these two aspects, in particular using even smaller quantities of data and borrowing analysis techniques of textual data from genre/domain studies. These techniques have been used also in order to choose a suitable parallel corpus for the final sub-sampling experiments, subsequently leading to the decision of creating a new parallel corpus from the web. In order to do so, a pipeline to collect parallel corpora from the web has been set up (based on previous but mostly currently unavailable attempts), and analysis the resulted the situation of the current presentation on the web as â€˜multilingual corpusâ€™ has been addressed as well.