BERT with CRF

16 Dec 2019

Reading time ~2 minutes

SKT Kobert-crf을 공부하면서 정리한 내용입니다. 공부하는데 정말 많이 도움됩니다. kobert-crf

Named Entity Recognition

Named Entity Recognition (NER)은 개체명인식이라고 하는데, 흔히, BIO tagging으로 labeling하는 Task다.

위 문제는 Input sequence(x)로 각 tagging ouput(y)을 찾는 문제로 볼 수 있다. ¹

BERT for NER

우리의 친구 BERT는 이러한 문제도 잘 해결하는데, BERT 논문에서 NER과 같은 task에 대해서 Encoder에 대한 output을 어떤 식으로 해야 하는 지 힌트를 주고 있다.

[CLS] token은 token들에 대한 정보들이 축약되어 있는 pooler이기 때문에

이를 활용하기 보단 single sentence에 대한 마지막 layer에서의 결과 (token들의 representation)을 활용한다는 의미다.

즉, 이 Bert의 last layer의 output는 sequence를 보다 잘 표현한 Embedding으로 이를 잘 활용한다면 된다는 의미다.

만약, feature-based approach로 접근하고 싶다면 논문에서 마지막 4개의 layer를 sum / concat해서 사용하면 된다.

why Condition Random Field?

condition random field의 설명은 여기가 맛집인 것 같다.

그러면 단지 BERT만 쓰면 될 것이지 왜 CRF를 쓰느냐 물어볼 수 있다.

이는 지극히 개인적인 생각인데, 우리가 주로 fine-tuning을 하는 경우 last-layer에 간단한 classifer 혹은 regressor를 넣는 경우가 많은데,

이미 복잡한 모델을 통한 중요한 feature를 내뽑았다면 classifier 혹은 regressor가 단순할 수록 over-fitting 문제로부터 자유롭기 때문이다.

즉, 중요한 feature들이 들어왔으니 간단한 모델로도 예측이 가능하다는 것이다. (아니라면 지적해주세요…)

그래서 paperswithcode의 NER task에서 딥러닝을 이용하는 경우에도 밑단에는 deep-layer 그리고 마지막에는 CRF가 적용되는 경우를 볼 수 있는 것 같다.

그렇다면 실제 BERT with CRF 구조를 보자

class KobertCRF(nn.Module):
    """ KoBERT with CRF """
    def __init__(self, config, num_classes, vocab=None) -> None:
        super(KobertCRF, self).__init__()

        if vocab is None:
            self.bert, self.vocab = get_pytorch_kobert_model()
        else:
            self.bert = BertModel(config=BertConfig.from_dict(bert_config))
            self.vocab = vocab

        self.dropout = nn.Dropout(config.dropout)
        self.position_wise_ff = nn.Linear(config.hidden_size, num_classes)
        self.crf = CRF(num_tags=num_classes, batch_first=True)

    def forward(self, input_ids, token_type_ids=None, tags=None):
        attention_mask = input_ids.ne(self.vocab.token_to_idx[self.vocab.padding_token]).float()
        all_encoder_layers, pooled_output = self.bert(input_ids=input_ids,
                                                      token_type_ids=token_type_ids,
                                                      attention_mask=attention_mask)
        last_encoder_layer = all_encoder_layers[-1]
        last_encoder_layer = self.dropout(last_encoder_layer)
        emissions = self.position_wise_ff(last_encoder_layer)

        if tags is not None:
            log_likelihood, sequence_of_tags = self.crf(emissions, tags), self.crf.decode(emissions)
            return log_likelihood, sequence_of_tags
        else:
            sequence_of_tags = self.crf.decode(emissions)
            return sequence_of_tags

당연히 Linear layer는 num_labels의 수와 맞추기 위해 추가된 layer다.

그럼 결국 BERT 논문에서 아키텍처에 대한 그림을 잘 활용한 모습이라고 볼 수 있다.

References :

https://lovit.github.io/nlp/2018/06/22/crf_based_ner/ ↩

repo="ghk829" issue-term="pathname" theme="github-dark" crossorigin="anonymous" async>