BERT, which means $Bidirectional \ Encoder \ Representations \ from \ Transformers$, is one kind of SOAT models in natural language preprocessing over multiple tasks. In this arctile, I want to note my opnion of this model architecture and its training method.

Beginning

BERT is proposed in 2018, from the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. We can know its model architecture derived from Transformer, which is also proposed by Google in 2017 from the paper Attention Is All You Need. Every time I mention these two amazing papers, I can hardly conceal my excitement. They provide creative ideas for computers to understand natural language. I even think all practitioners related to NLP should have basic knowledge of $BERT$.

To explain $BERT$ clearly, I plan to introduce Machine Translation task, and its most common model architecture Sequence to Sequence firstly. Then, I’ll introduce the Transformer, which is one kind of Seq2Seq models but involving Attention Mechanism and the performance has been greatly imporved. Finally, we’ll focus on BERT.

This article is my summary for related work and hope it’s helpful. Simultaneously, there are so many awesome articles and I list them at the end part Reference. Thanks for these authors’ excellent work to make all the core concepts clear.

Sequence to Sequence

$Seq2Seq$最常见的应用常见便是机器翻译，在过去的几年中，这个领域的研究有长足的突破，很多商业软件比如谷歌翻译、有道翻译等等达到了非常好的翻译效果。简单来说，机器翻译任务实现的是不同国家或者地区间语言的转换，能帮助人们更加畅通的沟通。

除了上述提到的机器翻译，$Seq2Seq$理论上来说适用于任何序列文本间的转换，比如编程语言间的转换，从C++转换到Golang。甚至，那些本就不存在的语言，但是只要有着一定量的训练数据，$Seq2Seq$总能为我们发现源语言与目标语言的映射关系，并且帮助我们建立这种关系。

首先，让我们忘记$Seq2Seq$架构，想一想如何描述机器翻译这个场景。彼时存在输入文本$\mathbf{x}={x_1, x_2, \ldots, x_n}$，通过翻译之后，将会得到另外一组序列文本$\mathbf{y}={y_1, y_2, \ldots, y_m}$。可以使用概率 $p(\mathbf{y} \mid \mathbf{x})$ 表示：
$$
\mathbf{y}^{*}=\arg \max_{\mathbf{y}} p(\mathbf{y} \mid \mathbf{x})
$$
理论上$\mathbf{y}$的空间是无穷大的，即可能出现很多个翻译结果，但最后需要的是那个概率最大的输出。

接着，将上述的式子进行改写，加入模型的参数$\theta$，如下所示：
$$
\mathbf{y}^{*}=\arg \max_{\mathbf{y}} p(\mathbf{y} \mid \mathbf{x}, \theta)
$$
现在，我们已经定义了一个关于机器翻译场景的概率模型。其实，和它相似的概率模型还有很多，比如实体识别，同样如上述定义，只不过输入$\mathbf{x}$与输出$\mathbf{y}$的长度是一样的。之后要考虑的问题主要围绕三个方面：

如何建模，即参数$\theta$长什么样子？
如何去训练参数？
如何推理？

实际上，对于大多数涉及到概率模型去解决业务问题的，无论是基于联合概率分布的比如HMM或是基于条件概率分布的比如CRF或是之后要介绍的Seq2Seq，最后求解的时候就是这三板斧。

BERT Explained

Beginning

Sequence to Sequence

Reference