AlexNet

pytorch

  
class AlexNet(nn.Module):
    def __init__(self, n_classes : int) -> None:
        super(AlexNet, self).__init__()
        self.n_classes = n_classes
        self.layer = nn.Sequential(
                              # conv1
                              nn.Conv2d(3, 64, 11, 4, 2)
                              nn.ReLU(inplace=True)
                              nn.MaxPool2d(3, 2)
                              # conv2
                              nn.Conv2d(64,192,5,padding=2)
                              nn.ReLU(inplace=True)
                              nn.MaxPool2d(3,2)
                              # conv3
                              nn.Conv2d(192,384,3,padding=1)
                              nn.ReLU(inplace=True)
                              # conv4
                              nn.Conv2d(384,256,3,padding=1)
                              nn.ReLU(inplace=True)
                              # conv5
                              nn.Conv2d(256,256,3,padding=1)
                              nn.ReLU(inplace=True) 
                              nn.MaxPool2d(3,2)
                              )

        self.avgpool = nn.AdaptiveavgPool2d((6,6))
        self.classifier = nn.Sequential(
                              # dropout
                              nn.Dropout(),
                              nn.Linear(256*6*6, 4096),
                              nn.ReLU(inplace=True),
                              nn.Dropout(),
                              nn.Linear(4096,4096),
                              nn.ReLU(inplace=True),
                              nn.Linear(4096, n_classes),
                              )
    
    def forward(self, x : torch.Tensor) -> torch.Tensor:
      x = self.layer(x)
      x = self.avgpool(x)
      # output shape : (batch size * 256, 6, 6)
      x = torch.flatten(x,1)
      # output shape : (batch_size, 256 * 6 * 6)
      x = self.classifier(x)
      return x

  
model1 = AlexNet(10)

import torchsummary
torchsummary.summary(model1, (3,256,256))

# --------------------- #

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [-1, 64, 63, 63]          23,296
              ReLU-2           [-1, 64, 63, 63]               0
         MaxPool2d-3           [-1, 64, 31, 31]               0
            Conv2d-4          [-1, 192, 31, 31]         307,392
              ReLU-5          [-1, 192, 31, 31]               0
         MaxPool2d-6          [-1, 192, 15, 15]               0
            Conv2d-7          [-1, 384, 15, 15]         663,936
              ReLU-8          [-1, 384, 15, 15]               0
            Conv2d-9          [-1, 256, 15, 15]         884,992
             ReLU-10          [-1, 256, 15, 15]               0
           Conv2d-11          [-1, 256, 15, 15]         590,080
             ReLU-12          [-1, 256, 15, 15]               0
        MaxPool2d-13            [-1, 256, 7, 7]               0
AdaptiveAvgPool2d-14            [-1, 256, 6, 6]               0
          Dropout-15                 [-1, 9216]               0
           Linear-16                 [-1, 4096]      37,752,832
             ReLU-17                 [-1, 4096]               0
          Dropout-18                 [-1, 4096]               0
           Linear-19                 [-1, 4096]      16,781,312
             ReLU-20                 [-1, 4096]               0
           Linear-21                   [-1, 10]          40,970
================================================================
Total params: 57,044,810
Trainable params: 57,044,810
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.75
Forward/backward pass size (MB): 10.96
Params size (MB): 217.61
Estimated Total Size (MB): 229.32
----------------------------------------------------------------

파라미터의 크기를 보면, linear layer에서 가장 크다. 그럼에도 불구하고, linear(4096, 4096)를 쓰는 이유는??

convolution 연산의 출력의 크기

out_w = (in_w - k_w + 2 * pad) / stride + 1
out_h = (in_h - k_h + 2 * pad) / stride + 1
out_channels = num_kernels

매개 변수의 수

커널마다 (kernel size x kernel size x in_channels)개의 가중치와 1개의 bias를 가진다. 따라서 전체 매개변수의 수는 (kernel size x kernel size x in_channels) x num_kernels + 1 x num_kernels

tensorflow

  
def AlexNet(input_shape=None, weight=None, classes=1000, classifier_activation='softmax') :
  model = tf.keras.Sequential([
    # conv1
    tf.keras.layers.Conv2D(filters=96,
                          kernel_size=(11,11),
                          stride=4,
                          padding="valid", # no zero padding
                          activation=tf.keras.activations.relu,
                          input_shape=input_shape),
    tf.keras.layers.MaxPool2D(pool_size=(3,3),
                            stride=2,
                            padding="valid"),
    tf.keras.layers.BatchNormalization(),
    # conv2
    tf.keras.layers.Conv2D(filters=256,
                          kernel_size=(5,5),
                          strides=1,
                          padding="same", # zero padding
                          activation=tf.keras.activations.relu),
    tf.keras.layers.MaxPool2D(pool_size=(3,3),
                            stride=2,
                            padding="valid"),
    tf.keras.layers.BatchNormalization(),
    # conv3
    tf.keras.layers.Conv2D(filters=384,
                          kernel_size=(3,3),
                          strides=1,
                          padding="same",
                          activation=tf.keras.activations.relu),
    # conv4
    tf.keras.layers.Conv2D(filters=384,
                          kernel_size=(3,3),
                          strides=1,
                          padding="same",
                          activation=tf.keras.activations.relu),
    # conv5
    tf.keras.layers.Conv2D(filters=256,
                          kernel_size=(3,3),
                          strides=1,
                          padding="same",
                          activation=tf.keras.activations.relu),
    tf.keras.layers.MaxPool2D(pool_size=(3,3),
                          stride=2,
                          padding="same"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Flatten(),

    # classifier
    tf.keras.layers.Dense(units=4096, # linear
                          activation=tf.keras.activations.relu),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(units=4096,
                          activation=tf.keras.activations.relu),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(units=classes,
                          activation=tf.keras.activations.softmax)
  ])

  return model

VGGNet16

pytorch

https://github.com/pytorch/vision/blob/main/torchvision/models/vgg.py

  
import torch
import torch.nn as nn

class VGG(nn.Module):
    def __init__(self, features : nn.Module, num_classes: int = 1000, init_weights : bool = True) -> None:
        super(VGG,self).__init__()
        self.features = features # make layers
        self.avgpool = nn.AdaptiveAvgPool2d((7,7)) # output shape : [in_channels, 7, 7]
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes)
        )

        if init_weights:
            self._initialize_weights()
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = self.avgpool(x)

        x = torch.flatten(x,1)
        x = self.classifier(x)
        return x


def vgg16(pretrained : bool = False, progress : bool = True, **kwargs) -> VGG:
  ''' VGG 16 layer model
  very deep convolutional networks for large scale image recognition
  args:
    pretrained (bool) : if True, returns a model pretrained on imageNet
    progress (bool) : if True, displays a progress bar of the download to stderr
  '''
  return _vgg('vgg16', 'D', False, pretrained, progress, **kwargs)

def vgg16_bn(pretrained : bool = False, progress : bool = True, **kwargs) -> VGG:
  return _vgg('vgg16', 'D', True, pretrained, progress, **kwargs)


def make_layers(cfg : dict, batch_norm : bool = False) -> nn.Sequential:
  layer : list[nn.Module] = []
  in_channels = 3
  for v in cfg:
    # max pooling
    if v == "M":
      layers += [nn.MaxPool2d(kernel_size=2, stride=2)]

    else:
      v = cast(int, v)
      conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
      # conv
      # batch_norm
      # activation
      if batch_norm:
        layers += [conv2d, nn.BatchNorm2d(v), nn.ReLU(inplace=True)]
      # conv
      # activation
      else:
        layers += [conv2d, nn.ReLU(inplace=True)]

      # 다음 conv input channel 
      in_channels = v
  return nn.Sequential(*layers)

cfgs : dict = {
  'A' : [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
  'B' : [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
  'D' : [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
  'E' : [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M']
}

def _vgg(arch : str, cfg : str, batch_norm : bool, pretrained : bool, progress : bool, **kwargs : Any) -> VGG:
  if pretrained: kwargs['init_weights'] = False

  model = VGG(make_layers(cfgs[cfg], batch_norm=batch_norm), **kwargs)
  ...

tensorflow

  
def VGG16(input_shape=None, weights='imagenet', input_tensor=None, pooling=None, classes=1000, classifier_activation='softmax') :
  ''' 
  args:
    input_shape : classifier를 포함할지 말지에 대한 인자
  '''
  layer = tf.keras.Sequential([
    # Block1
    tf.keras.layers.Conv2D(filters=64,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          input_shape=input_shape
                          name='block1_conv1'),
    tf.keras.layers.Conv2D(filters=64,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block1_conv2'),
    tf.keras.layers.MaxPool2D(pool_size=(2,2),
                            stride=2,
                            name='block1_pool'),

    # Block2
    tf.keras.layers.Conv2D(filters=128,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block2_conv1'),
    tf.keras.layers.Conv2D(filters=128,
                          kernel_size=(3,3),
                          padding="same", 
                          activation='relu',
                          name='block2_conv2'),
    tf.keras.layers.MaxPool2D(pool_size=(2,2),
                            stride=2,
                            name='block2_pool'),

    # Block3
    tf.keras.layers.Conv2D(filters=256,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block3_conv1'),
    tf.keras.layers.Conv2D(filters=256,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block3_conv2'),
    tf.keras.layers.Conv2D(filters=256,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block3_conv3'),
    tf.keras.layers.MaxPool2D(pool_size=(2,2),
                            stride=2,
                            name='block3_pool'),

    # Block4
    tf.keras.layers.Conv2D(filters=512,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block4_conv1'),
    tf.keras.layers.Conv2D(filters=512,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block4_conv2'),
    tf.keras.layers.Conv2D(filters=512,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block4_conv3'),
    tf.keras.layers.MaxPool2D(pool_size=(2,2),
                            stride=2,
                            name='block4_pool'),

    # Block5
    tf.keras.layers.Conv2D(filters=512,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block5_conv1'),
    tf.keras.layers.Conv2D(filters=512,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block5_conv2'),
    tf.keras.layers.Conv2D(filters=512,
                          kernel_size=(3,3),
                          padding="same",
                          activation='relu',
                          name='block5_conv3'),
    tf.keras.layers.MaxPool2D(pool_size=(2,2),
                            stride=2,
                            name='block5_pool'),

    if include_top:

      tf.keras.layers.Flatten(name='flatten'),

      # classifier
      tf.keras.layers.Dense(units=4096, # linear
                          activation='relu',
                          name='fc1'),
      tf.keras.layers.Dense(units=4096, # linear
                          activation='relu',
                          name='fc2'),
      tf.keras.layers.Dense(units=classes, # linear
                          activation=classifier_activation,
                          name='prediction'),
  ])

  if weights == 'imagenet':
    if include_top:
      weight_path = data_utils.get_file(
        'vgg16_weights_tf_dim_ordering_tf_kernels.h5',
        WEIGHTS_PATH,
        cache_subdir='models',
        file_hash = '')
    else:
      weight_path = data_utils.get_file(
        'vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5',
        WEIGHTS_PATH_NO_TOP,
        cache_subdir='models',
        file_hash = '')

    model.load_weights(weights_path)
  else:
    model.load_weights(weights)
  
  return model

  
model = tf.keras.Sequential()
for l in layer:
  model.add(l)

model.add(keras.layers.Dense(num_classes, activation='softmax', name="prediction"))

GoogLeNet

1x1 커널

이 때, 사용했던 1x1 커널을 설명하고 넘어가자.

1x1 커널이란 차원의 통합 및 축소를 할 수 있는 효과를 가지고 있다.

예를 들어 input이 [in_b, in_c, in_h, in_w] 를 가지고 있고, filter를 [1, in_c, 1, 1] 을 사용하게 되면 output의 shape은 [in_b, 1, (in_h - 1 + 2p) // stride + 1, (in_h - 1 + 2*pad) // stride + 1] 을 가진다. 만약 pad=0, stride=1이라면 (in_h - 1) + 1 = in_h, in_w가 된다. 따라서 1x1 커널을 사용하게 되면 크기는 그대로지만, 차원을 축소시킬 수 있는 효과를 가지고 있다.

  
def __init__(self):
  self.fc0 = nn.Linear(1024 * 8 * 8, self.n_classes, bias = False)

def forward(self, x):
  ...
  x = x.view(-1,1024 * 8 * 8)
  x = self.fc0(x)

# --------------- #
Linear-33                   [-1, 10]         655,360

  
def __init__(self):
  self.conv6 = nn.Conv2d(1024, 10, kernel_size=1, stride=1, padding=0)

def forward(self,x):
  ...
  x = self.conv6(x)

# ---------------- #
Conv2d-33             [-1, 10, 8, 8]          10,250

이 두 경우의 weight를 비교해보았더니 fc layer를 사용할 경우 약 65.5만개의 weight가 존재했지만, 1x1 conv를 통해 차원을 축소했더니 1만개에 불과했다.

(kernel size x kernel size x in_channels) x num_kernels + 1 x num_kernels 를 통해 직접 파라미터의 수를 구해보면, (1 x 1 x 1024 x 10 + 1 x 10) = 10,250개에 불과하므로 약 65배가 차이나는 것을 확인할 수 있다.

inception module

또 하나 googLeNet의 핵심은 인셉션 모듈이라는 다양한 특징을 추출하기 위해 NIN 구조를 확장한 복수의 병렬적인 컨볼루션 층을 사용했다. NIN 구조에서는 기존의 컨볼루션 연산을 MLPConv 연산으로 대체하여 커널 대신 비선형 함수를 활성함수로 포함하는 MLP를 사용하여 특징을 추출한다.

a가 기존의 방법으로, 하나의 receptive field를 커널과 연산하여 output을 내는 방식이었다. 그러나 MLPconv 층의 경우에는 한 receptive field를 MLP의 입력으로 넣어 연산을 한다. 이렇게 하면 해당하는 위치에 대해 완전 연결, FC layer를 사용하므로 특징을 조금 더 잘 추출할 수도 있고, receptive field안의 또 다른 다양한 receptive field 크기를 만들 수 있게 된다.

NIN의 또다른 특징으로는 전역 평균 풀링을 사용했다. MLPconv layer에서의 특징맵 채널을 분류하고자 하는 클래스의 수만큼으로 만들고, 각각의 클래스가 1개의 채널만을 가리킬 수 있도록, 즉 1개의 채널마다의 평균을 구해서 확률값으로 넣어서 모든 채널을 평균내면 클래스의 수만큼의 길이로 output이 생성된다.

예를 들어, 이와 같이 3채널일 때, 이를 각각의 채널을 평균 내서 각각의 클래스에 해당되도록 만든다. 그러면 최종 output은 [batch, n_classes, 1, 1]의 shape을 가지게 된다. 이를 fc layer에 넣은 후 softmax를 적용하여 확률값으로 만들어 출력한다. 이와 같은 방법의 가장 큰 장점은 전역 평균 풀링으로 만들어진 feature map에서 평균내는 것에는 매개변수가 존재하지 않는다는 것이다. 이전의 방법인 VGGNet에서는 3개의 fc layer를 사용했고, 이에 대한 매개변수는 1억2천2백만 개의 매개변수를 가지는데, 그에 반해 전역 평균 풀링과 1개의 fc layer를 사용하면 1억개의 매개변수가 사라지게 된다.

이러한 MLPconv 개념을 googLeNet에 활용한 방법이 인셉션 모듈이다. MLP 대신 네 종류의 컨볼루션 연산을 사용하여 다양한 특징을 추출한다.

또한, 그림을 보면, a처럼 1x1,3x3,5x5,7x7… 을 하게되면 계속 연산량이 증가하게 된다. 그래서 b처럼 1x1 conv를 사용해서 차원을 줄여 연산량을 줄인 후 연산을 한다. 마지막에는 3x3 maxpooling을 추가했다. 이렇게 구해진 4개 각각의 feature map을 concat하기 위해서는 차원을 맞춰줘야 한다. 그를 위해 maxpooling에도 1x1 convolution을 하여 차원을 변경한다.

그래서 googLeNet은 inception 모듈을 9개 결합했다. 그리고 예전에는 학습이 잘 진행되지 않아서 보조 분류기를 사용해서 역전파의 결과를 결합해서 경사 소멸 문제를 완화했다. 학습할 때 도움을 주고, 추론할 때는 제거한다.

ResNet

ResNet은 residual block을 사용했다. 층이 깊을수록 좋은 것은 확인했으나 실제 20개의 layer와 50개의 layer를 사용했을 때 학습을 진행해보면, 20layer가 오히려 loss가 더 낮게 나오는 경향이 발생했다. 이러한 이유는 최적화가 잘 되지 않았다는 것이라 생각했고, 이를 해결하는 방법으로 residual learning을 생각했다.

residual block에 사용되는 residual learning의 특징으로는 원래의 학습이 입력 x를 넣으면 conv-relu-conv 를 거쳐 출력 H(x)가 된다고 할 때, 입력 x에서 H(x)가 되기 위한 변화량을 F(x)라 하여 H(x) = F(x) + x로 식을 새로 만들었다. 이 때 F(x)를 잔류(잔차)(residual)이라 한다.

https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py

[데브코스] 11주차 - DeepLearning Many CNN models structure

AlexNet

VGGNet16

GoogLeNet

ResNet

Further Reading

[데브코스] 9주차 - DeepLearning Numpy & Matplotlib

[데브코스] 9주차 - DeepLearning AI & machine learning

[데브코스] 9주차 - DeepLearning Mathematics for Machine Learning