Skip to main content

NN

基础

经典贝叶斯公式:P(YX)=P(XY)P(Y)P(X)P(Y \mid X) = \frac{P(X \mid Y)\,P(Y)}{P(X)}- YY:假设 / 隐变量(hypothesis)

  • XX:观测数据
  • P(Y)P(Y):先验(prior)
  • P(XY)P(X \mid Y):似然(likelihood)
  • P(YX)P(Y \mid X):后验(posterior)

CNN

  • MLP的问题:全连接数据量过大
  • 数据特性:图像局部性 locality
  • 平移不变性

RNN

时序问题

ht=ϕ(w2ht1+w1xt+b)h_t = \phi(w_2h_{t-1} + w_1x_t + b) ϕ\phi 的选择:

  • 非0: Relu
  • 0 - 1: sigmoid (多个结果 softmax)
  • -1 - 1: tanh

问题:

w2w_2 累乘,gradient 要么Exploding 要么 Vanishing -> gradient clipping g=min(1,θg)gg = min(1, \frac{\theta}{||g||})g θ\theta is threshold

重要的时序信息没有被区别对待 -> 根据 x 的重要程度调节 h (LSTM)

LSTM

长短期记忆

ht=ϕ(w2ht1+w1xt+b)h_t = \phi(w_2h_{t-1} + w_1x_t + b)

forget output input

GRU

Gated Recurrent Unit

forget老记忆: ht1=ftht1h^{'}_{t-1} = f_th_{t-1} 新记忆: Ct~=tanh(w2ht1+w1xt+b)\tilde{C_t} = tanh(w_2h^{'}_{t-1} + w_1x_t + b) 最终: ht=rtht1+(1rt)Ct~h_t = r_th_{t-1} + (1 - r_t)\tilde{C_t}

Attention

seq2seq information bottleneck

QKV attention decoder hidden state Q=w1dhtQ = w_1dh_t encoder hidden state K=w2ehiK = w_2eh_i similarity αt=softmax(QKdk)\alpha_t=softmax(\frac{QK}{\sqrt{d_k}}) Ct=αtVC_t = \alpha_t V

Resnet

深度模型难以训练,vanishing exploding gradient -> normalization 深度越深,能力反而下降,非overfitting -> resnet

H(x)=F(x)+xH(x) = F(x) + x element-wise

Shortcut connection highway networks on LSTM H(x)=F(x)g(x)+x(1g(x))H(x) = F(x)g(x) + x(1-g(x)) 效果不如resnet

维度不一致? A. zero padding B. 如果不一致则投影 C. 全部投影 选B

Deeper bottleneck architecture 没有过分 过拟合

h(x)=f(x)+xh(x) = f(x) + x y=g(h(x))+h(x)y = g(h(x))+h(x) h(x)x=f(x)x+1\frac{\partial h(x)}{\partial x} = \frac{\partial f(x)}{\partial x} +1 yx=g(h)+hh+f(x)+xx=(g(h)+1)(f(x)+1)=g(h)f(x)+g(h)+f(x)+1\frac{\partial y}{\partial x} = \frac{\partial g(h) + h}{\partial h} + \frac{\partial f(x) + x}{\partial x} = (g^{'}(h) + 1)(f^{'}(x) + 1) = g^{'}(h)f^{'}(x)+ g^{'}(h) + f^{'}(x) + 1

1是reserve gradient的关键,故而 highway networks 更差(λn\lambda^n

Transformer

seq2seq sota: RNN - LSTM -GRU 问题:inherently sequential nature -> parallelization

Convolutional seq2seq:

self attention 特征提取,自适应,全部上下文,计算简单

multichannel 多个kernel,不同视角

MHA: multi head

Positional Encoding PE(pos,2i)=sin(pos100002id)PE(pos,2i) = sin(\frac{pos}{10000 \frac{2i}{d}})

不需要计算可学参数 数值大小稳定,不光绝对位置,还有相对位置

element wise 相加真的能被识别吗? 有些head关注位置 BertViz

Decoder masked attention 设为负无穷,softmax后为0

Layer TypeComplexity per LayerSequential OperationsMaximum Path Length
Self-AttentionO(n² · d)O(1)O(1)
RecurrentO(n · d²)O(n)O(n)
ConvolutionalO(k · n · d²)O(1)O(logₖ(n))
Self-Attention (restricted)O(r · n · d)O(1)O(n / r)

Text to Image

Conditional generation Classifier based guidance 训练前分,麻烦 Fixed Guidance 训练引入 prompt,diversity 差,遵循指令 Classifier Free 调节 guidance 强度

Classifier-based Guidance

以Diffusion为例:P(xt1xt)P(x_{t-1}|x_t) -> Guided: P(xt1xt,y)P(x_{t-1}|x_t,y)

P(xt1xt,y)=P(xt1xt)P(yxt1,xt)P(yxt)=P(xt1xt)P(yxt1)P(yxt)=P(xt1xt)elogP(yxt1)logP(yxt)P(xt1xt)e(xt1xt)xtlogP(yxt)\begin{aligned} P(x_{t-1}\mid x_t,y) &= \frac{P(x_{t-1}\mid x_t) P(y\mid x_{t-1}, x_t)}{P(y\mid x_{t})} \\ &= \frac{P(x_{t-1}\mid x_t) P(y\mid x_{t-1})}{P(y\mid x_{t})} \\ &= P(x_{t-1}\mid x_t) e^{ \log{P(y\mid x_{t-1})} - \log{P(y\mid x_{t})} }\\ &\approx P(x_{t-1}\mid x_t) e^{ (x_{t-1} - x_t)\cdot\nabla_{x_t} \log{P(y \mid x_t)}} \\ \end{aligned} P(xt1xt)=N(xt1;μ(xt),σt2I)ext1μ(xt)2/2σt2P(x_{t-1}\mid x_t) = N(x_{t-1}; \mu(x_t), \sigma_t^2 I ) \propto e^{{- \Vert {x_{t-1} - \mu(x_t)} \Vert^2} / {2\sigma_t^2}} P(xt1xt,y)ext1μ(xt)22σt2+(xt1xt)xtlogP(yxt)ext1μ(xt)σt2xtlogP(yxt)2/2σt2\Rightarrow P(x_{t-1}\mid x_t,y) \propto e^{\frac{- \Vert {x_{t-1} - \mu(x_t)} \Vert^2}{2\sigma_t^2} + (x_{t-1} - x_t)\cdot\nabla_{x_t} \log{P(y \mid x_t)} } \propto e^{{- \Vert {x_{t-1} - \mu(x_t) - \sigma_t^2\cdot\nabla_{x_t} \log{P(y \mid x_t)}} \Vert^2}/{2\sigma_t^2}} xt1=μ(xt)+σt2xtlogP(yxt)+σtε\Rightarrow x_{t-1} = \mu(x_t) + \sigma_t^2 \nabla_{x_t}\log{P(y \mid x_t)} + \sigma_t\varepsilon

Fixed Guidance

以flow-matching为例: LCFMguided(θ;y)=E(x)μtθ(xy)μt(xx1)2\mathcal{L}_{CFM}^{guided}(\theta ; y) = \mathbb{E}_{(x)}\Vert {\mu_t^\theta(x \mid y) - \mu_t(x | x_1)} \Vert^2

Classifier-free Guidance (SOTA)

Hierarchical/Cascaded Diffusion https://arxiv.org/abs/2106.15282

LDM https://arxiv.org/abs/2112.10752

Stable diffusion Unet