EncoderDecoder で日英翻訳をしてみた (その3)

前回からの続き。

Chainer を用いて、Attention つきの EncoderDecoder を実装する。

モデルの実装

まずはモデルのコンストラクタ。前回と同じく、LSTM の実装には NStepLSTM を使った。

class EncoderDecoder(Chain):
    def __init__(self, input_dimension: int, output_dimension: int, hidden_dimension: int):
        super().__init__()
        with super().init_scope():
            self._embed_input = L.EmbedID(input_dimension, hidden_dimension)
            self._embed_output = L.EmbedID(output_dimension, hidden_dimension)

            self._encoder = L.NStepLSTM(
                n_layers=1,
                in_size=hidden_dimension,
                out_size=hidden_dimension,
                dropout=0.1)
            self._decoder = L.NStepLSTM(
                n_layers=1,
                in_size=hidden_dimension,
                out_size=hidden_dimension,
                dropout=0.1)

            # 前回の図で言うところの W_C
            self._context_layer = L.Linear(2 * hidden_dimension, hidden_dimension)

            # 前回の図で言うところの W_D
            self._extract_output = L.Linear(hidden_dimension, output_dimension)

次に順伝播を行うメソッド __call__ は、以下のような実装になった。前回とほとんど同じだが、self._encoder() の３番目の出力を attentions という名前で保存していること、それを使って self._calculate_attention_layer_output を呼び出しているところが異なっている。

    def __call__(self, xs: List[Variable], ys: List[Variable]) -> Variable:
        batch_size = len(xs)
        xs = [x[::-1] for x in xs]

        eos = np.array([EOS], dtype=np.int32)
        ys_in = [F.concat((eos, y), axis=0) for y in ys]
        ys_out = [F.concat((y, eos), axis=0) for y in ys]

        embedded_xs = [self._embed_input(x) for x in xs]
        embedded_ys = [self._embed_output(y) for y in ys_in]

        hidden_states, cell_states, attentions = \
            self._encoder(None, None, embedded_xs)

        _, _, embedded_outputs = \
            self._decoder(hidden_states, cell_states, embedded_ys)

        loss = 0
        for embedded_output, y, attention in zip(embedded_outputs, ys_out, attentions):
            output = self._calculate_attention_layer_output(embedded_output, attention)
            loss += F.softmax_cross_entropy(output, y)
        loss /= batch_size

        return loss

_calculate_attention_layer_output の実装は以下のようになった。前回説明した数式をそのまま計算しているだけだが、 $T_y$ 個の出力全てをバッチ的に計算しているため、ちょっとややこしい。（前回の図では文脈ベクトルを計算するところまでしか attention layer に含めていなかったが、この関数は attention なしのモデルとコードを共有する都合上、 $W_D$ の出力まで計算するように実装してある）

    def _calculate_attention_layer_output(
            self, embedded_output: Variable, attention: Variable) -> Variable:
        inner_prod = F.matmul(embedded_output, attention, transb=True)
        weights = F.softmax(inner_prod)
        contexts = F.matmul(weights, attention)
        concatenated = F.concat((contexts, embedded_output))
        new_embedded_output = F.tanh(self._context_layer(concatenated))
        return self._extract_output(new_embedded_output)

学習の実装

学習は素の EncoderDecoder と全く同じコードになった。愚直にミニバッチを実装するだけ。

def train(dataset: DataSet, n_epoch: int = 20):
    model = EncoderDecoder(
        input_dimension=dataset.ja_vocabulary.size,
        output_dimension=dataset.en_vocabulary.size,
        hidden_dimension=512)

    optimizer = optimizers.Adam()
    optimizer.setup(model)

    batch_size = 128

    for epoch in range(n_epoch):
        shuffled = np.random.permutation(dataset.n_sentences)

        for i in range(0, dataset.n_sentences, batch_size):
            logger.info("Epoch {}: Mini-Batch {}".format(epoch, i))

            indices = shuffled[i:i+batch_size]
            xs = [Variable(dataset.ja_sentences[j]) for j in indices]
            ys = [Variable(dataset.en_sentences[j]) for j in indices]

            model.cleargrads()
            loss = model(xs, ys)
            loss.backward()
            optimizer.update()

        yield model, epoch

翻訳

その１でモデルの実装と学習の部分のみ紹介して、学習したモデルで翻訳するコードを紹介し忘れていたので、ここで紹介する。

単語列 sentence を翻訳するメソッド translate() は以下のように書ける。 _translate_one_word は後で説明する。

class EncoderDecoder(Chain):
    ...

    def translate(self, sentence: np.ndarray, max_length: int = 30) -> List[int]:
        with chainer.no_backprop_mode(), chainer.using_config('train', False):
            sentence = sentence[::-1]

            embedded_xs = self._embed_input(sentence)
            hidden_states, cell_states, attentions = \
                self._encoder(None, None, [embedded_xs])

            wid = EOS
            result = []

            for i in range(max_length):
                output, hidden_states, cell_states = \
                    self._translate_one_word(wid, hidden_states, cell_states, attentions)

                wid = np.argmax(output.data)
                if wid == EOS:
                    break
                result.append(wid)

            return result

Encoderは学習時と全く同じ計算を行う。 Deocder は、入力に一個前の Decoder が出力した確率最大の単語を使うところが学習時と異なる。

chainer.using_config('train', False) によって NStepLSTM の dropout を無効にしておかないといけないことに注意。

_translate_one_word の実装は以下の通り:

    def _translate_one_word(self, wid, hidden_states, cell_states, attentions):
        y = np.array([wid], dtype=np.int32)
        embedded_y = self._embed_output(y)
        hidden_states, cell_states, embedded_outputs = \
            self._decoder(hidden_states, cell_states, [embedded_y])

        output = self._calculate_attention_layer_output(embedded_outputs[0], attentions[0])
        output = F.softmax(output)

        return output[0], hidden_states, cell_states

学習時とロジックはほぼ同じ。学習時は softmax_cross_entropy で loss を計算していたが、翻訳のときは loss を計算するのではなく、softmax をそのまま返す。

実験結果

中間層の次元を512次元として、その1で作成した訓練セット（157,154文）に対して、学習を行った。

8 Epoch 学習させたモデルに対してテストセットの20文を入力してみると以下のようになった。 JA と EN がテストセットに入っている例文であり、ED が素の EncoderDecoder モデルの出力、ED+Atn が今回説明した Attention 付き EncoderDecoder の出力である。モデル名に続く数値は、その文に対する BLEU スコアを表す。

JA    : ----- : 昼食 会 に １ ０ 人 を 招待 し た 。
EN    : ----- : We asked ten people to the luncheon .
ED    : 0.000 : A few people to have lunch up the other .
ED+Atn: 0.000 : We invited ten people at the meeting .

JA    : ----- : 彼女 の 言葉づかい に は 誤り が 多い 。
EN    : ----- : Her grammar is bad .
ED    : 0.000 : She has a lot of errors .
ED+Atn: 0.000 : There are many mistakes in her speech .

JA    : ----- : その 文書 に は その 戦い が １ ７ ０ ０ 年 に 起こっ た と 記録 さ れ て いる 。
EN    : ----- : The document records that the war broke out in 1700 .
ED    : 0.000 : The admission is reported by the total disorder in the year round the year .
ED+Atn: 0.000 : The statue was set in a panic in the world about 10 years ago .

JA    : ----- : はるか その 島 が 見え ます 。
EN    : ----- : You can see the island in the distance .
ED    : 1.000 : You can see the island in the distance .
ED+Atn: 0.000 : The island is able to see the island .

JA    : ----- : 昨日 は 一 日 中 英単語 を 暗記 し た 。
EN    : ----- : I learned English words by heart all day yesterday .
ED    : 0.000 : Yesterday the day was forty yesterday .
ED+Atn: 0.000 : I learned English the English words all day .

JA    : ----- : 第 二 の 問題 を 取り上げ ましょ う 。
EN    : ----- : Let 's take up the second problem , shall we ?
ED    : 0.000 : Let 's discuss the problem together .
ED+Atn: 0.000 : Let 's take two second hand .

JA    : ----- : トム は 空港 へ 向かう 途中 だ 。
EN    : ----- : Tom is on his way to the airport .
ED    : 0.597 : Tom is on the way to the airport .
ED+Atn: 0.525 : Tom is the way to the airport .

JA    : ----- : 私 たち は それ を 実行 不可能 と おもっ た こと が ない 。
EN    : ----- : We never thought of it as impossible to carry out .
ED    : 0.000 : We never have never seen a word .
ED+Atn: 0.000 : We 've got to feel nothing but practice .

JA    : ----- : あと で わかっ た こと だ が 、 彼 は 親切 な 男 だっ た 。
EN    : ----- : He was a kind man , as I later discovered .
ED    : 0.000 : I was lucky , but he was kind of man .
ED+Atn: 0.000 : What was he said , he was a kind of man to see a man .

JA    : ----- : 人々 は より 多く の 自由 と 平等 を 求める 。
EN    : ----- : People pursue more freedom and equality .
ED    : 0.000 : People often compare with them one another .
ED+Atn: 0.000 : People equal most more liberty than more .

JA    : ----- : 私 は 決して あなた 失望 さ せ ませ ん 。
EN    : ----- : I 'll never let you down .
ED    : 1.000 : I 'll never let you down .
ED+Atn: 1.000 : I 'll never let you down .

JA    : ----- : 彼 は とっつき やすい 人 だ 。
EN    : ----- : He is a friendly person .
ED    : 0.000 : He is a man of virtue .
ED+Atn: 1.000 : He is a friendly person .

JA    : ----- : その 少女 は 雇主 の 金 を もっ て 逃げ た 。
EN    : ----- : The girl made off with her employer 's money .
ED    : 0.000 : The little girl saved her money .
ED+Atn: 0.000 : The girl ran into a top of the game .

JA    : ----- : 試験 中 は なかなか 大変 だっ た 。
EN    : ----- : It was tough going during the exams .
ED    : 0.000 : I was very hard during the test .
ED+Atn: 0.000 : The test was very pretty .

JA    : ----- : 変 な こと を する から 頭 を 診 て もらい なさい よ 。
EN    : ----- : You should have your head examined .
ED    : 0.000 : Do n't ignore your head down .
ED+Atn: 0.000 : Ask your hair on explaining things .

JA    : ----- : 今月 の 終わり に 私 の 事務所 に 来 なさい 。
EN    : ----- : Come to my office at the end of this month .
ED    : 0.000 : Come to the end of the end of the month .
ED+Atn: 0.588 : Come to my close to the end of this month .

JA    : ----- : あなた 同様 私 は 興奮 など し て い ない 。
EN    : ----- : I am no more excited than you are .
ED    : 0.000 : I am all excited !
ED+Atn: 0.000 : I am not excited any more excitement .

JA    : ----- : 今日 は 独立 記念 日 です 。
EN    : ----- : Today is Independence Day .
ED    : 0.000 : Today is the day of an expert .
ED+Atn: 0.000 : It is a long day , is a lovely day .

JA    : ----- : この 原則 は 子供 に のみ 適用 さ れる 。
EN    : ----- : This general rule refers only to children .
ED    : 0.000 : This classroom is designed to be a large copy of this .
ED+Atn: 0.000 : This rule applies to children .

JA    : ----- : その 事故 は 私 が くる 前 に 起こっ た 。
EN    : ----- : The accident happened previous to my arrival .
ED    : 0.000 : The accident took place before me .
ED+Atn: 0.000 : The accident happened prior to me .

素のモデルよりも訳が改善しているものもあるが、悪化したものもある。自分の主観的には素のモデルよりも訳が全体的に改善しているように思えた。

また、同じ単語を複数回訳出してしまう傾向が素の EncoderDecoder よりも強まっているように見える。 Luong 2015 にあるように、input feeding を行えばましになるのかもしれない。

定量的評価

テストセットに対する BLEU スコアの平均値を、素の EncoderDecoder と Attention 付き EncoderDecoder それぞれで評価した。また、それぞれのモデルの Epoch 4, 8, 20 のスコアをそれぞれ求め、比較した。

モデル	エポック	BLEU
ED	4	0.1143
ED	8	0.1256
ED	20	0.1190
ED+Atn	4	0.1519
ED+Atn	8	0.1578
ED+Atn	20	0.1508

Attention を付け加えることによって、結果がかなり改善していることがわかる。

素の EncoderDecoder のときと同様、Attention付きの EncoderDecoder でも Epoch 20 の時に BLEU が下がっている。正則化が足りていないのかもしれない。

まとめ

Attention を付け加えることによって EncoderDeocder の精度を向上させることができた。

次回は、翻訳時にビームサーチを行い、精度の向上を目指す。