主要是对 2022/05 读的 Paper 的总结。

涉及的主要主题估计是 Anomaly Detection 和 Image Generation.

StyleGAN

Link: https://arxiv.org/abs/1812.04948

这篇工作提出了一个新的生成网络架构,如下图右侧所示。

image-20220503172918548

其有以下创新点:

  • 因为 $z$ 一般是 normalize 之后的结果,所以变量很难解耦合。其最初的 8 层 FC 将 $z$ 映射为 $w$ 是为了进行变量内蕴含的特征的解耦合。

  • 风格迁移:$\operatorname{AdaIN}(\mathbf{x}_{i}, \mathbf{y})=\mathbf{y}_{s, i} \frac{\mathbf{x}_{i}-\mu(\mathbf{x}_{i})}{\sigma(\mathbf{x}_{i})}+\mathbf{y}_{b, i}$,其中 $x_i$ 本来蕴含了其特征 $\mu, \sigma$,而做了 normalize 之后,再乘以 $y_{s,i} = \mu(y_i), y_{b,i}=\sigma(y_i)$,得到在特征空间 $Y_i$ 下的 style

  • 解释了其生成架构的 Properties

    • Style Mixing: 其通过生成 $z_1, z_2$ 两个不同的 latent code,然后过 FC 之后得到 $w_1, w_2$,然后得到两张图片记为 A, B。然后,把 (b) 图中上半部分的特征输入用 $w_1$ 的上半部分,下半部分的特征输入用 $w_2$ 的下半部分,得到新图片 $C$。通过调整“上半部分”与“下半部分”层数的多少,来看每一层到底在控制什么特征。
    • Noise:现象是,对不同层的输出变量施加轻微扰动,其控制效果也不同。其猜测的原因:We hypothesize that at any point in the generator, there is pressure to introduce new content as soon as possible, and the easiest way for our network to create stochastic variation is to rely on the noise provided
  • 提供了两种量化“特征解耦合度”的方法

    • Perceptual path length

      • Motivation: 从 $z_1$ 到 $z_2$ 做插值,图像的变化应该越小越好

      • This is to avoid that features that are absent in either endpoint may appear in the middle of a linear interpolation path.

    • Linear separability

      • Motivation: 解耦合的空间应该能找到某一种特征对应的方向向量
      • We propose another metric that quantifies this effect by measuring how well the latent-space points can be separated into two distinct sets via a linear hyperplane, so that each set corresponds to a specific binary attribute of the image.
      • Use SYM…

Enhancing photorealism enhancement

Link: https://arxiv.org/abs/2105.04619

这篇文章的方法是 unpaired image-to-image translation,which learns a mapping from one image collection to another.

image-20220503194719309

这张图是这篇文章中所用到的方法的导览,下面一一对其介绍。

  • G-buffer
    • image-20220503195709338
  • Image Enhancement Network
    • image-20220503195335903
    • 基于 HRNetV2(在 Dense Prediction 预测上表现突出)改动
  • Training Objectives
    • LPIPS Loss:focuses on structural differences
    • Realism Score
  • A specific sampling strategy during training

Methods

G-Buffer Encoder

image-20220503200117762

其中每个 Stream 都包含两个 Residual Block,架构如下图:

image-20220503200353851

Perceptual Discriminator

image-20220503201157142

Layout Differences cause Artifacts

不同数据集中的图片有不同的风格,比如 GTA V 中图像上方往往是天空,而 CityScape 中往往是山或树林。Discriminator 可以通过这一点来判断到底应该是 Fake 还是 Real。如果我们对 GTA V 中的图片做增强的话,这可能倾向于让 Generator 在图像上方生成树。

image-20220503203716048

  • 解决方案?Sampling only matching patches. So what is matching between different datasets?
    • Crop size: 7% of the full image per patch
    • Extract 1 x 1 x 512-dim feature map, using a VGG network pretrained on ImageNet at the last relu layer
    • Let $p_i, p_j \in$ Different datasets, then call them “matching” if their cosine similiarity > 0.5.
    • Using FAISS could accelerate the process!

Metrics

IS, FID, KID 是常用的衡量图片真实感的 score。但是 KID 的问题是,他只能衡量图片的真实感,即使是图片所表示的物体结构发生了变化,它是反应不出来的。于是本作提出新的 Metrics sKVD:

  • Extract patches of 1/8 image size from the semantic label maps of souce and target datasets
  • Downsample these patches to 16x16 resolution to obtain a 256-dim vector
  • For each such vector from the synthetic dataset, we fifind the nearest neighbor in the set of vectors from the real dataset. We retain pairs of vectors with more than 50% matching entries.

Standardized Max Logits

A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation

Key Idea: standardize the max logits to align the different distributions and reflect the relative meanings of max logits within each predicted class.

image-20220505200158778

启发是因为这张图,对于没有出现过的物体,Road 的 Max Logit 远远大于其他种类,导致被分类成 Road 的可能性较大。

即使是把这些 Max Logit 都做了标准化,还会遇到的问题是,在 instance 的边界,the boundary pixels tend to have low resolution scores,作者通过 iterative boundary suppresion 的方法对其进行了解决。然后,作者进一步使用了 dilated smoothing 方法对 anomaly score 进行平滑化。

image-20220505201429689

  1. standardizing the max logits in a class-wise manner
  2. iterative boundary suppression
    1. Propate the SMLs of the neighboring non-boundary pixels to the boundary regions
    2. Start from the outer areas of the boundary to the inner areas
    3. To be specifific, we assume the boundary width as a particular value and update the boundaries by iteratively reducing the boundary width at each iteration.
  3. dilated smoothing
    1. Gaussian kernel

Normalizations

我们假设输入的数据维度为 $(N, C, H, W)$.

image-20220513232103204

Batch Normalization

torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True)

在 batch 上,对 $(N, H, W)$ 做归一化。设 $x.shape = (C, H, W)$,其中 num_features $=C$,则 $y=\frac{x-\mathrm{E}[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}} * \gamma+\beta$,其中 $\gamma, \beta$ 当 affine=True 时启用。

这里的 momentum 是用来更新统计数据的均值与方差的。准确的说,$x_{\text {new }}=(1 \text {-momentum })^{} x+\text { momentum }{ }^{} x_{t}$.

InstanceNorm2d

torch.nn.InstanceNorm2d(num_features, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)

InstanceNorm 在图像像素上对 $(H, W)$ 做 Normalization.

LayerNorm

torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True)

GroupNorm

Spectral Norm

https://www.youtube.com/watch?v=vG-oEreLG-Q

ViT