دانلود پایان نامه دکترای مترجمی زبان انگلیسی Domain Adaptation for Translation Models in Statistical Machine Translation که شامل 147 صفحه و بشرح زیر میباشد:
نوع فایل : PDF
Abstract
We investigate methods to adapt translation models in SMT to a specific target domain.
We discuss two major problems, unknown words because of data sparseness in the (indomain)
training data, and ambiguities arising from out-of-domain parallel texts with different
domain-specific translations. We propose novel solutions to both problems.
The main contributions of this thesis are as follows:
We present a novel translation model architecture that supports domain adaptation at
decoding time from a vector of component models. The combination is implemented
through instance weighting, and all statistics necessary for the computation of translation
probabilities are stored in the models.
We present an architecture to combine multiple MT systems, using techniques and
ideas from domain adaptation. The hypotheses by external MT systems are treated
as out-of-domain knowledge, and combined with in-domain data through instance
weighting.
We introduce a sentence alignment algorithm that is able to robustly align even noisy
parallel texts. We found that higher-quality sentence alignment of the in-domain parallel
text has a significant effect on translation quality in our target domain.
We propose new translation model features that express how flexible, or general, translation
units are, in order to prevent translations that only occur in the context of multiword
expressions from being overgeneralised.
Wir untersuchen Methoden zur Anpassung von Übersetzungsmodellen in SMÜ an eine bestimmte
Zieldomäne. Wir diskutieren zwei Hauptprobleme: spärliche Daten in den Trainingsdaten
der Zieldomäne führen zu unbekannten Wörtern, und der Herbeizug von Daten
aus Fremddomänen verursacht Mehrdeutigkeiten. Für beide Probleme präsentieren wir neue
Lösungsansätze.
Die Hauptbeiträge dieser Dissertation sind folgende:
Wir präsentieren eine Architektur für Übersetzungsmodelle, welche aus einem Vektor
von Teilmodellen besteht und Domänenadaption während der Übersetzung selbst
erlaubt. Die Kombination der Teilmodelle wird über eine Gewichtung von Vorkommenshäufigkeiten
vollzogen.
Wir stellen eine Architektur zur Kombination verschiedener Übersetzungssysteme
mittels Techniken aus der Domänenadaption vor. Die Hypothesen externer Übersetzungssysteme
werden dabei wie Wissen aus einer Fremddomäne behandelt, und mit
Daten aus der Zieldomäne kombiniert.
Wir präsentieren ein Satzalignierungsverfahren, welches auch verrauschte parallele
Texte robust auf Satzebene alignieren kann. Durch die Erhöhung der Satzalignierungsqualität
erreichen wir eine signifikant bessere Übersetzungsqualität.
Wir schlagen neue Merkmale für Übersetzungsmodelle vor, welche die Flexibilität
von Übersetzungseinheiten ausdrücken, und verhindern, dass inflexible Übersetzungen,
welche nur innerhalb eines Mehrwortausdrucks vorkommen, übergeneralisiert
werden.
Contents
1 Introduction 17
1.1 Problem: Domain-specific Statistical Machine Translation 17
1.2 Thesis Contributions 18
1.3 Outline 19
2 Statistical Machine Translation 21
2.1 Statistical Models for Machine Translation 21
2.1.1 Word-based SMT 21
2.1.2 Log-Linear Models 22
2.2 Phrase-based Translation Models 23
2.2.1 Learning Phrase Translations 23
2.3 Discriminative Training 24
2.4 SMT Evaluation 25
2.4.1 BLEU and METEOR 25
2.4.2 Randomness and Statistical Significance 27
2.5 Alternative Translation Models 27
2.5.1 Hierarchical and Syntax-based Translation Models28
2.5.2 N-Gram Translation Models 28
2.5.3 Continuous Space Translation Models29
2.6 Domain Adaptation in SMT 30
2.6.1 Language Model Adaptation 30
2.6.2 Translation Model Adaptation 31
3 Domain-specific Language 35
3.1 The Text+Berg Corpus 35
3.2 Europarl 36
3.3 Linguistic Differences between Text+Berg and Europarl 36
4 Building a Domain-specific SMT system 43
4.1 Experimental Data and Model Configurations 43
4.1.1 Corpora 43
4.1.2 Tools and Models 45
4.2 SMT Learning Curves: How Important is In-domain Data? 46
4.3 Summary 52
5 Improving Data Collection: Sentence Alignment 53
5.1 Related Work 55
5.2 MT-based Sentence Alignment 56
5.3 Bleualign: Algorithm 57
5.3.1 Weighting Sentence Pairs58
5.3.2 Dynamic Programming Search 58
5.3.3 Additional Alignment Procedures 59
5.4 Evaluation of Sentence Alignment 60
5.5 On the Relation Between Sentence Alignment Quality and SMT Performance 62
5.6 Summary64
6 Translation Model Combination: Tackling the Ambiguity Problem 65
6.1 Discussion of Domain Adaptation Techniques 66
6.1.1 Log-linear Interpolation66
6.1.2 Linear Interpolation 67
6.1.3 Instance Weighting 69
6.1.4 Data Selection 70
6.1.5 Priority Merge 71
6.1.6 Origin Features 71
6.2 Perplexity 72
6.2.1 Theoretical Background72
6.2.2 Translation Model Perplexity73
6.2.3 Perplexity Minimization 75
6.3 Evaluation of Domain Adaptation Techniques 76
6.3.1 Data and Methods 76
6.3.2 Results 78
6.4 The Impact of Weights 87
6.5 Domain Adaptation with Unsupervised Clustering of Training Data 91
6.5.1 Clustering with Exponential Smoothing 92
6.5.2 Model Combination 94
6.5.3 Evaluation 94
6.6 A Multi-Domain Translation Model Architecture 96
6.7 Summary 100
7 Integrating Other Knowledge Sources: Multi-Engine Machine Translation 103
7.1 Related Work103
7.2 A Multi-Engine MT Architecture 104
7.3 Translation Model Combination 105
7.4 Evaluation of Multi-Engine MT 106
7.4.1 On the Use of Perplexity for Machine-Translated Text 109
7.4.2 Combining Out-of-domain Data and Translation Hypotheses 111
7.5 Summary 112
8 Multiword Expressions and Flexibility Features 115
8.1 Introduction 116
8.2 Related Work 116
8.3 Learning Translations in SMT 117
8.4 Flexibility Features 118
8.4.1 Variants for Hierarchical Phrase-based Models 121
8.5 Filtering Hierarchical Rule Tables 122
8.6 Evaluation of Flexibility Scores 123
8.6.1 Data and Methods 123
8.6.2 Phrase-based Results 124
8.6.3 Hierarchical Results 126
8.7 Summary 127
9 Conclusion and Outlook 129
Bibliography 133
10 Appendix 147