Michigan Online 的 Deep Learning for Computer Vision.

前面在讲 Machine Learning 相关的东西所以直接跳过去了,这里是 Lecture 15 ~ Lecture 22 的笔记。

Lecture 15: Object Detection



  • Classification:针对整张图片的
  • Semantic Segmentation:针对图片的不同 Part 分不同类别
    • 这东西能不能做成 self-supervised?似乎可以?
  • Object Detection
    • Input: Single RGB Image
    • Output: a set of detected objects; For each object predict:
      • Category label
      • Bounding box (four numbers: x, y, width, height)
    • Challenging Points
      • 多输出,不定数目
      • 输出不仅输出类型,还要输出包围盒
      • 高分辨率的图片

Detecting a single object: Multi-task Loss


Detecting Multiple Objects: Sliding Window

  • 加一个分类叫做 background,表示没有识别到物体


问题:枚举包围盒需要 $O(H^2W^2)$,而且可能重复检测…二分法,树结构?

Region Proposals

Key Idea: find a small set of boxes that are likely to cover all objects

  • Often based on heuristics: e.g. look for “blob-like” image regions
  • Relatively fast to run: Selective Search

R-CNN: Region-Based CNN

  • Run Region Proposals first, and get ~2k region candidates
  • For each of the candidate regions, they could be different sizes, but we warp that region by 224 * 224
  • Forward the classification network using the padded region, to decide if it is background or objects contained
  • Learn another region transformation network: from 224 x 224 to Delta (x, y, h, w)

    • image-20220426162529821
  • That can be summarized as:

    • image-20220426163111643


  • Evaluating (Comparing Boxes): Intersection over Union (IoU)
    • How can we compare our prediction to the ground-truth box?
    • $\text{IoU} = \dfrac{\text{Area of Intersection}}{\text{Area of Union}}$
    • Also called Jaccard similarity or Jaccard Index
    • IoU > 0.5 is “decent”
    • IoU > 0.7 is “pretty good”
    • IoU > 0.9 is “almost perfect”

Overlapping Boxes

  • Object detectors often output many overlapping detections
  • Trouble: images very crowded with people


Evaluating Object Detectors: Mean Average Precision (mAP)


Get all APs from each category, then compute the mean value.

记作 mAP@0.5 = 0.77. IoU 是参数。

Fast R-CNN


RoI Pool: Region of Interest Pooling

  • Must come into a differential way…
  • Region Proposals suggest on the original picture
  • Just project proposal onto image features, then snap it to grid cells…


  • Divide it up into sub regions of equal area, like 2x2
  • Perform Max-Pooling within each of the region
  • Avoid Snapping? RoI Align, use bilinear interpolation to get everything down right…

Faster R-CNN: Learnable Region Proposals

  • Insert a region-proposal network, or RPN, to propose the regions
  • How to use CNN to output region proposals in a trainable way?


  • Imagine anchor boxes of fixed size at each point in the feature map
  • Use a CNN to decide, if the anchor box contains an object…
  • Again, for positive boxes, also predict a box transform to regress from anchor box to object box…

Two stages


Faster R-CNN 可以分为两个阶段考虑:

  • First stage: Run once per image
    • Backbone Network to extract features
    • Region proposal network to suggest regions
  • Second stage: Run once per region
    • Crop features: RoI pool / align
    • Predict object class
    • Predict b-box offset

Single-Stage Object Detection



Mask R-CNN: For Instance Segmentation

Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN.

Lecture 17: 3D Vision

In this section:

  • Predicting 3D Shapes from single images
  • Processing 3D Input data

    And we have more interesting topics!


We’ll begin by introducing from 3D Shape Representations:

  • Depth Map
  • Voxel Grid
  • Implicit Surface
  • Point Cloud
  • Mesh

Generating Voxel Shapes: “Voxel Tubes”


The voxel tubes indicates that, we’ll output a tensor that has channel size $V$ and $(V,V)$ shape for each channel.

当然可以将右半部分变成 3D CNN,这样增加 Z 轴上的平移不变性。

  • Voxel Problem: Memory Usage
    • Scaling Voxels: Oct-Trees
    • Nested Shape Layers

Implicit Functions



  • SDF: Signed Distance Function
  • The function can be learned!