EncoderDecoder で日英翻訳をしてみた (その1)

『Chainer による実践深層学習』の８章を読みながら色々試してみた。

EncoderDecoder

EncoderDecoder は、機械翻訳などの、系列が入力され系列が出力されるような問題を扱うモデル。 seq2seq とも呼ばれるらしい。

以下のような好ましい性質を備えている：

入力された系列の長さと、出力する系列の長さが異なっていてもよい
- 日本語から英語への翻訳を考えると、一般に、日本語の文に含まれる単語数と英語の文に含まれる単語数は異なる。
入力と出力の対応関係が自明じゃなくていい
- 例えば、日本語と英語では語順が異なるので簡単な対応付けができない。

EncoderDecoder は以下のようなニューラルネットワークになっている。

f:id:nojima718:20171010012354p:plain

緑色の点線で囲われた部分が Encoder で、オレンジ色の点線で囲われた部分が Decoder。

Encoder には入力系列 $x_1, \ldots, x_{T_x}$ を入力する（ $T_x$ は入力系列の長さ）。 Encoder に入力された単語は Embed レイヤー (単語IDを受け取る linear layer) を通って LSTM に入力される。 LSTM は一個前の LSTM から来た隠れ状態と現在の単語を受け取って、次の隠れ状態を出力する。

Decoder 側の LSTM は、「次の単語」を出力するように学習する。 Decoder の最初の LSTM は、入力として EOS (を embed したベクトル) と、Encoder から渡された隠れ状態を受け取り、出力系列の最初の単語を出力するように学習する。次の LSTM はその隠れ状態と出力系列の最初の単語を受け取り、出力系列の2番目の単語を出力するように学習する。最後の LSTM は EOS を出力するように学習する。（ $W_D$ は embed された単語ベクトルを元の次元に引き戻すための線形レイヤー）

予測時は、Encoder に入力系列を入れ、Decoder の最初のユニットに EOS を入れると、出力系列の最初の単語が得られる。この単語を Decoder の2番目のユニットに入れると2番目の単語が得られる。以下同様に EOS が出力されるまでこれを繰り返すと出力系列全体が得られる。

こんなんで翻訳なんてできるわけないだろ！！と言いたくなるような、冗談みたいな仕組みだが、これである程度翻訳できてしまうので恐ろしい。

データセットの用意

データの入手

日英の対訳データは tatoeba.org から入手した。 tatoeba.org は大量の対訳データが Creative Commons の下で配布されており、とても素晴らしい。しかも、例文の内容が教科書的で、とても機械学習で使いやすいものになっている。

このサイトのダウンロードページから「例文」と「リンク」をダウンロードした。例文ファイルは、ID, 言語名, 例文が並んだ形式となっており、リンクはリンク元IDとリンク先ID が並んだ形式となっている。リンクでつながっている例文同士が、互いに対訳関係にある例文だ。

データの整形

機械学習で利用しやすいように、日本語の文と英語の文の対訳ペアのリストをまず作成した。これにより 196,443 個の対訳ペアが得られた。

次に、文を単語の列に分解した。

日本語の文の分割には MeCab を使った。辞書は素の ipadic を使った。日本語文の平均単語数は 11.5 単語だった。

英語の文は split や re で雑に分割することもできるが、can't とか U.S. みたいな不規則なクオートやらピリオドやらの扱いが面倒だったので、 OpenNLP の TokenizerME サブコマンドを使った。 TokenizerME に渡すモデルは http://opennlp.sourceforge.net/models-1.5/ にある en-token.bin を使った。英語文の平均単語数は 9.2 単語だった。

訓練セットとテストセットに分割

データセットの対訳ペアの8割（157,154 個）を訓練セットに割り当てた。残った2割から訓練用データに含まれる日本語文と同じ日本語文を持つ対訳ペア 5509 個を削除し、残った対訳ペア 33,780 個をテストセットとした。

モデルの実装

Chainer には NStepLSTM という LSTM が連なったようなネットワークを表す Link が用意されているので、以下のように簡単にモデルを記述できる。

class EncoderDecoder(Chain):
    def __init__(self, input_dimension: int, output_dimension: int, hidden_dimension: int):
        super().__init__()

        with super().init_scope():
            self._embed_input = L.EmbedID(input_dimension, hidden_dimension)
            self._embed_output = L.EmbedID(output_dimension, hidden_dimension)

            self._encoder = L.NStepLSTM(
                n_layers=1,
                in_size=hidden_dimension,
                out_size=hidden_dimension,
                dropout=0.1)
            self._decoder = L.NStepLSTM(
                n_layers=1,
                in_size=hidden_dimension,
                out_size=hidden_dimension,
                dropout=0.1)

            # Embed の逆を行う行列を表す良い名前がほしい。
            self._extract_output = L.Linear(hidden_dimension, output_dimension)

    def __call__(self, xs: List[Variable], ys: List[Variable]) -> Variable:
        batch_size = len(xs)
        xs = [x[::-1] for x in xs]

        eos = np.array([EOS], dtype=np.int32)
        ys_in = [F.concat((eos, y), axis=0) for y in ys]
        ys_out = [F.concat((y, eos), axis=0) for y in ys]

        embedded_xs = [self._embed_input(x) for x in xs]
        embedded_ys = [self._embed_output(y) for y in ys_in]

        hidden_states, cell_states, _ = self._encoder(None, None, embedded_xs)
        _, _, embedded_outputs = self._decoder(hidden_states, cell_states, embedded_ys)

        loss = 0
        for embedded_output, y in zip(embedded_outputs, ys_out):
            output = self._extract_output(embedded_output)
            loss += F.softmax_cross_entropy(output, y)
        loss /= batch_size

        return loss

__call__ が上でごちゃごちゃ書いていた学習アルゴリズムの実装。引数である xs や ys は、ミニバッチのために、単一の Variable じゃなくて List[Variable] を受け取るようになっている。

学習

学習はいつもどおりミニバッチでぐるぐる回すだけ。 DataSet は自分で定義した型で、日英の対訳データと語彙集合をまとめて保持しているオブジェクト。学習の途中でモデルを保存できるように、１エポック終わるたびにモデルを yield するようにしている。

def train(dataset: DataSet, n_epoch: int = 20):
    model = EncoderDecoder(
        input_dimension=dataset.ja_vocabulary.size,
        output_dimension=dataset.en_vocabulary.size,
        hidden_dimension=512)

    optimizer = optimizers.Adam()
    optimizer.setup(model)

    batch_size = 128

    for epoch in range(n_epoch):
        shuffled = np.random.permutation(dataset.n_sentences)

        for i in range(0, dataset.n_sentences, batch_size):
            logger.info("Epoch {}: Mini-Batch {}".format(epoch, i))

            indices = shuffled[i:i+batch_size]
            xs = [Variable(dataset.ja_sentences[j]) for j in indices]
            ys = [Variable(dataset.en_sentences[j]) for j in indices]

            model.cleargrads()
            loss = model(xs, ys)
            loss.backward()
            optimizer.update()

        yield model, epoch

実験結果

中間層の次元を 512 に設定し、上で作成した訓練セットを使って 20 エポック学習させてみた。出来上がったモデルにテストセットの例文を20個適当に食わせてみると以下のような出力になった。 (JA と EN がテストセットに入っている例文で、Output がモデルの出力)

    JA: 昼食 会 に １ ０ 人 を 招待 し た 。
    EN: We asked ten people to the luncheon .
Output: A one invited ten people to the meeting .

    JA: 彼女 の 言葉づかい に は 誤り が 多い 。
    EN: Her grammar is bad .
Output: Her heart 's beating wildly .

    JA: その 文書 に は その 戦い が １ ７ ０ ０ 年 に 起こっ た と 記録 さ れ て いる 。
    EN: The document records that the war broke out in 1700 .
Output: The document was $ with 500 years when the battle was due to the stock .

    JA: はるか その 島 が 見え ます 。
    EN: You can see the island in the distance .
Output: You can see the picture on the island .

    JA: 昨日 は 一 日 中 英単語 を 暗記 し た 。
    EN: I learned English words by heart all day yesterday .
Output: The whole day passed the day off yesterday and broke out .

    JA: 第 二 の 問題 を 取り上げ ましょ う 。
    EN: Let 's take up the second problem , shall we ?
Output: Let 's discuss the problem with two .

    JA: トム は 空港 へ 向かう 途中 だ 。
    EN: Tom is on his way to the airport .
Output: Tom 's going to the airport in a minute .

    JA: 私 たち は それ を 実行 不可能 と おもっ た こと が ない 。
    EN: We never thought of it as impossible to carry out .
Output: We have never thought of it .

    JA: あと で わかっ た こと だ が 、 彼 は 親切 な 男 だっ た 。
    EN: He was a kind man , as I later discovered .
Output: He had been alone , I knew to be a serious man .

    JA: 人々 は より 多く の 自由 と 平等 を 求める 。
    EN: People pursue more freedom and equality .
Output: People deal for freedom as handsome as freedom .

    JA: 私 は 決して あなた 失望 さ せ ませ ん 。
    EN: I 'll never let you down .
Output: I 'll never let you down .

    JA: 彼 は とっつき やすい 人 だ 。
    EN: He is a friendly person .
Output: He is easy to get a big shot .

    JA: その 少女 は 雇主 の 金 を もっ て 逃げ た 。
    EN: The girl made off with her employer 's money .
Output: The little girl had her money for her .

    JA: 試験 中 は なかなか 大変 だっ た 。
    EN: It was tough going during the exams .
Output: I should n't have much grade in the examination .

    JA: 変 な こと を する から 頭 を 診 て もらい なさい よ 。
    EN: You should have your head examined .
Output: Have a hard of mind if you ca n't remember me .

    JA: 今月 の 終わり に 私 の 事務所 に 来 なさい 。
    EN: Come to my office at the end of this month .
Output: You should come to my office on the end of this month .

    JA: あなた 同様 私 は 興奮 など し て い ない 。
    EN: I am no more excited than you are .
Output: I am excited !

    JA: 今日 は 独立 記念 日 です 。
    EN: Today is Independence Day .
Output: It 's no use as a market for the children .

    JA: この 原則 は 子供 に のみ 適用 さ れる 。
    EN: This general rule refers only to children .
Output: This classroom is not at a firm as a child .

    JA: その 事故 は 私 が くる 前 に 起こっ た 。
    EN: The accident happened previous to my arrival .
Output: The accident happened before I arrived .

完璧に訳せている文はそんなに多くはないが、思った以上にちゃんと「英文」になっていて、文法ミスがほどんどない。 EncoderDecoder のアルゴリズムには文法の知識が全く入っていないにも関わらず。

また、たった512次元のベクトルに全情報を突っ込んでいるにも関わらず、元の意味が復元できている場合がそれなりに多い。直感的には全然うまく行きそうにないモデルなのに、こんな割合で成功しているのが恐ろしい。

機械学習の結果でここまで感動したのは久しぶりだった。

Epoch 20 以外の翻訳結果は Gist に書いた: https://gist.github.com/nojima/186bf4ebf51c6be32d45e6cf9e680c3e

定量的評価

BLEU を使って翻訳結果を評価した。本当は全エポックのモデルを評価して変化を見たかったんだけど、あまりにも時間がかかるので 4, 8, 20 番目のエポックの結果だけを評価した。

Epoch	BLEU
4	0.1143
8	0.1256
20	0.1190

うっすら感じていたんだけど、やっぱり途中から過学習して結果が悪くなってる。もうちょっと強く正規化したほうがよさそう。

次回予告

Attention を実装したので、それについて書きたい。また、予測時に greedy に単語を選んでいくのではなくてビームサーチを使って単語を選ぶアルゴリズムを実装したので、それも書きたい。

→ 書いた。