NLP/BERT

GPT-1

self-attention을 12겹 쌓은 모델

encoding시에 masked self attention 사용한다.

BERT(pre-training of Deep Bidirectional Transformers for language understanding)

language model은 단지 하나의 context만 이용한다.

GPT 모델과 다르게 BERT는 전체 주어진 단어를 attention에 쓴다.(?)