๐Ÿ‘ฉโ€๐Ÿ’ป Lab 3

NAS
TinyML
lab
Neural Architecture Search(NAS) Experiment
Author

Seunghyun Oh

Published

March 16, 2024

์ด๋ฒˆ ์‹œ๊ฐ„์€ Neural Architecture Search(NAS)์—์„œ ์‹ค์Šต์„ ํ•ด๋ณด๋Š” ์‹œ๊ฐ„์ด์˜€์–ด์š”. ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋„คํŠธ์›Œํฌ๋ฅผ ๋” ๊นŠ๊ฒŒ ๋งŒ๋“ค๊ฑฐ๋‚˜, ์ฑ„๋„์„ ๋” ํฌ๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์‹ค์ œ๋กœ ์ฝ”๋“œ ์˜ˆ์‹œ๊ฐ€ ์นœ์ ˆํ•˜๊ฒŒ ๋ผ ์žˆ์–ด, ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ์ž์„ธํžˆ ๋ณด๊ธฐ ์ข‹์•˜๋˜ ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค. ์˜์–ด๋กœ ๋œ ์„ค๋ช…์€ NAS ๊ฐ•์˜์— ๋‚˜์˜ค๋Š” ์ž๋ฃŒ๋ผ ๊ผญ ์ฝ์œผ์‹ค ํ•„์š”๋Š” ์—†์–ด์š”. ๊ทธ๋ฆฌ๊ณ  ์ค‘๊ฐ„์ค‘๊ฐ„์— ์ดํ•ด๋ฅผ ๋•๊ธฐ ์œ„ํ•œ ๋‹ค์ด์–ด๊ทธ๋žจ์ด๋‚˜ ์„ค๋ช…์ด Getting Started ๋ถ€๋ถ„์— ์žˆ์–ด์„œ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์•„์š”.

๊ทธ๋Ÿผ ์‹œ์ž‘ํ•ด๋ณด์‹œ์ฃ !

Introduction

์ฒ˜์Œ์—๋Š” ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ๋“ค๊ณผ ์—ฐ๊ตฌ์— ํ•ด๋‹นํ•˜๋Š” ๋ชจ๋ธ์„ ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ์ €ํฌ๊ฐ€ ์˜ค๋Š˜ ์‹ค์Šตํ•  ๋ชจ๋ธ์€ Once for All(OFA) MCUNet ์ด๋‹ˆ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.

์˜ค๋Š˜ OFA MCUNet์—์„œ๋Š” ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์ด๋ฏธ ํ›ˆ๋ จํ•œ ๋ชจ๋ธ์„ ๊ฐ€์ง€๊ณ  ์ฑ„๋„ ์ˆ˜๋ฅผ ์ค„์ด๊ฑฐ๋‚˜, ๋ ˆ์ด์–ด์˜ ์ˆ˜๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•œ โ€œsubsetโ€๋“ค์„ ๊ฐ€์ง€๊ณ  ๋ฉ”๋ชจ๋ฆฌ์™€ ์—ฐ์‚ฐ์†๋„(MAC)๋ฅผ ํ‰๊ฐ€ํ•  ๊ฒ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ €ํฌ๊ฐ€ ์›ํ•˜๋Š” MAC๊ณผ Peak Memory๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์„ ์ฐพ์•„๋ณด๋Š” ๊ฒƒ์ด์ฃ .

์–ด๋–ป๊ฒŒ Constraint์— ๋งž๋Š” ๋ชจ๋ธ์„ ์ฐพ์„ ๊ฒƒ์ด๋ƒํ•˜๋ฉด, ๋ฐ”๋กœ Accuracy Predictor ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒ๋‹ˆ๋‹ค. ๋ชจ๋ธ๊ตฌ์กฐ์™€ Accuracy์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ OFA MCUNet์—์„œ ๋ชจ์€ ๋‹ค์Œ, ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ชจ๋ธ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๊ทธ ๋ชจ๋ธ์— ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„ฃ์œผ๋ฉด ๋ชจ๋ธ ์ •ํ™•๋„๊ฐ€ ๋‚˜์˜ค๋Š”, ๊ทธ๋Ÿฐ ๋ชจ๋ธ์ด ๋˜๋Š”๊ฑฐ์ฃ . ๋งˆ์ง€๋ง‰์œผ๋กœ Accuracy Predictor๋ฅผ ๊ฐ€์ง€๊ณ  ์ž„์˜๋กœ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ์ƒ˜ํ”Œ์„ ๋ชจ์•„ ์›ํ•˜๋Š” Constraint์— ๋งž๋Š” ๋ชจ๋ธ์„ ์ฐพ๋Š” ๊ฒ๋‹ˆ๋‹ค. ๊ฐ•์˜์—์„œ NAS๋ฅผ ์†Œ๊ฐœํ•˜๋Š” ๋ชฉ์ ์€ โ€œํฐ ๋ชจ๋ธ์—์„œ ์ž‘์€ ๋””๋ฐ”์ด์Šค์— ๋„ฃ๊ธฐ ์œ„ํ•ด์„œ Sub-Network๋ฅผ ์›ํ•˜๋Š” ์ŠคํŽ™์— ๋งž๊ฒŒ ์ฐพ์•„ ๋„ฃ๋Š”๋‹ค.โ€ ์ธ ๊ฒ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ์ž‘์€ ๋””๋ฐ”์ด์Šค์— ๋Œ€ํ•œ ์˜ˆ์‹œ๋กœ MCU, Alexa, Google Home์„ ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๋ณด์—ฌ์ฃผ์ฃ .

But the tight memory budget (50,000x smaller than GPUs) makes deep learning deployment difficult.

There are 2 main sections: accuracy & efficiency predictors and architecture search.

  • For predictors, there are 4 questions in total. There is one question (5 pts) in the Getting Started section and the other three questions (30 pts) are in the Predictors section.
  • For architecture search, there are 6 questions in total.

์ด์ œ ๊ฐ์„คํ•˜๊ณ  ํ•˜๋‚˜์”ฉ ์‹คํ—˜ํ•ด๋ณผ๊ฒŒ์š”! ํŒจํ‚ค์ง€๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์„ค์น˜ํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

First, install the required packages and download the Visual Wake Words dataset that will be used in this lab.

# print("Cleanning up workspace ...")
# # !rm -rf *
# print("Installing graphviz ...")
# # !sudo apt-get install graphviz 1>/adev/null
# print("Downloading MCUNet codebase ...")
# !wget https://www.dropbox.com/s/3y2n2u3mfxczwcb/mcunetv2-dev-main.zip?dl=0 >/dev/null
# !unzip mcunetv2-dev-main.zip* 1>/dev/null
# !mv mcunetv2-dev-main/* . 1>/dev/null
# print("Downloading VWW dataset ...")
# !wget https://www.dropbox.com/s/169okcuuv64d4nn/data.zip?dl=0 >/dev/null
# print("Unzipping VWW dataset ...")
# !unzip data.zip* 1>/dev/null
# print("Installing thop and onnx ...")
# !pip install thop 1>/dev/null
# !pip install onnx 1>/dev/null
import argparse
import json
from PIL import Image
from tqdm import tqdm
import copy
import math
import numpy as np
import os
import random
import torch
from torch import nn
from torchvision import datasets, transforms
from mcunet.tinynas.search.accuracy_predictor import (
    AccuracyDataset,
    MCUNetArchEncoder,
)

from mcunet.tinynas.elastic_nn.networks.ofa_mcunets import OFAMCUNets
from mcunet.utils.mcunet_eval_helper import calib_bn, validate
from mcunet.utils.arch_visualization_helper import draw_arch


%matplotlib inline
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore')

Getting Started: Super Network and the VWW dataset

์‹คํ—˜์—์„œ๋Š” ์ด๋ฏธ ํ›ˆ๋ จํ•œ MCUNetV2 super network ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ถ€๋ถ„, OFA MCUNet ํด๋ž˜์Šค๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋ถ€๋ถ„, ๊ทธ๋ฆฌ๊ณ  Sub-network๋ฅผ ๊ฐ€์ ธ์™€ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ณ  ์ €ํฌ๊ฐ€ ์›ํ•˜๋Š” Constraint(๋ฉ”๋ชจ๋ฆฌ, ์—ฐ์‚ฐ์†๋„)์— ๋งž๋Š” ๋ชจ๋ธ์„ ์ฐพ๋Š” ์ฝ”๋“œ ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

  • MCUNetV2 is a family of efficiency neural networks tailored for resource-constrained microntrollers. It utilizes patch-based inference, receptive field redistribution and system-NN co-design and greatly improves the accuracy-efficiency tradeoff of MCUNet.
def build_val_data_loader(data_dir, resolution, batch_size=128, split=0):
    # split = 0: real val set, split = 1: holdout validation set
    assert split in [0, 1]
    normalize = transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
    kwargs = {"num_workers": min(8, os.cpu_count()), "pin_memory": False}

    val_transform = transforms.Compose(
        [
            transforms.Resize(
                (resolution, resolution)
            ),  # if center crop, the person might be excluded
            transforms.ToTensor(),
            normalize,
        ]
    )
    val_dataset = datasets.ImageFolder(data_dir, transform=val_transform)

    val_dataset = torch.utils.data.Subset(
        val_dataset, list(range(len(val_dataset)))[split::2]
    )
        
    val_loader = torch.utils.data.DataLoader(
        val_dataset, batch_size=batch_size, shuffle=False, **kwargs
    )
    return val_loader
data_dir = "data/vww-s256/val"

val_data_loader = build_val_data_loader(data_dir, resolution=128, batch_size=1)

vis_x, vis_y = 2, 3
fig, axs = plt.subplots(vis_x, vis_y)

num_images = 0
for data, label in val_data_loader:
    img = np.array((((data + 1) / 2) * 255).numpy(), dtype=np.uint8)
    img = img[0].transpose(1, 2, 0)
    if label.item() == 0:
        label_text = "No person"
    else:
        label_text = "Person"
    axs[num_images // vis_y][num_images % vis_y].imshow(img)
    axs[num_images // vis_y][num_images % vis_y].set_title(f"Label: {label_text}")
    axs[num_images // vis_y][num_images % vis_y].set_xticks([])
    axs[num_images // vis_y][num_images % vis_y].set_yticks([])
    num_images += 1
    if num_images > vis_x * vis_y - 1:
        break

plt.show()

์—ฌ๊ธฐ์„œ OFA MCUNet์˜ Design Space๊ฐ€ \(>10^{19}\) ๋‚˜ ๋œ๋‹ค๊ณ  ํ•˜๋„ค์š”. ์–ด๋งˆ์–ด๋งˆํ•˜์ฃ ? Subnet์€ inverted MobileNet blocks๋กœ ๊ตฌ์„ฑ๋ผ ์žˆ์œผ๋ฉด์„œ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋ฐ”๊พธ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋Š” kernel sizes (3, 5, 7), expand ratios (3, 4, 6), depth, global channel scaling (0.5x, 0.75x, 1.0x) (specified by width_mult_list) ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ์„ค๋ช…์€ ์ด๋”ฐ๊ฐ€ ๊ณ„์†ํ• ๊ฒŒ์š”.

device = "cuda:0"
ofa_network = OFAMCUNets(
    n_classes=2,
    bn_param=(0.1, 1e-3),
    dropout_rate=0.0,
    base_stage_width="mcunet384",
    width_mult_list=[0.5, 0.75, 1.0],
    ks_list=[3, 5, 7],
    expand_ratio_list=[3, 4, 6],
    depth_list=[0, 1, 2],
    base_depth=[1, 2, 2, 2, 2],
    fuse_blk1=True,
    se_stages=[False, [False, True, True, True], True, True, True, False],
)

ofa_network.load_state_dict(
    torch.load("vww_supernet.pth", map_location="cpu")["state_dict"], strict=True
)

ofa_network = ofa_network.to(device)
from mcunet.utils.pytorch_utils import count_peak_activation_size, count_net_flops, count_parameters

def evaluate_sub_network(ofa_network, cfg, image_size=None):
    if "image_size" in cfg:
        image_size = cfg["image_size"]
    batch_size = 128
    # step 1. sample the active subnet with the given config.
    ofa_network.set_active_subnet(**cfg)
    # step 2. extract the subnet with corresponding weights.
    subnet = ofa_network.get_active_subnet().to(device)
    # step 3. calculate the efficiency stats of the subnet.
    peak_memory = count_peak_activation_size(subnet, (1, 3, image_size, image_size))
    macs = count_net_flops(subnet, (1, 3, image_size, image_size))
    params = count_parameters(subnet)
    # step 4. perform BN parameter re-calibration.
    calib_bn(subnet, data_dir, batch_size, image_size)
    # step 5. define the validation dataloader.
    val_loader = build_val_data_loader(data_dir, image_size, batch_size)
    # step 6. validate the accuracy.
    acc = validate(subnet, val_loader)
    return acc, peak_memory, macs, params

We also provide a handly helper function to visualize the architecture of the subnets. The function takes in the configuration of the subnet and returns an image representing the architecture.

def visualize_subnet(cfg):
    draw_arch(cfg["ks"], cfg["e"], cfg["d"], cfg["image_size"], out_name="viz/subnet")
    im = Image.open("viz/subnet.png")
    im = im.rotate(90, expand=1)
    fig = plt.figure(figsize=(im.size[0] / 250, im.size[1] / 250))
    plt.axis("off")
    plt.imshow(im)
    plt.show()

์œ„ ์ฝ”๋“œ๋ฅผ ์ด์šฉํ•ด์„œ ๋ชจ๋ธ๊ตฌ์กฐ ์‹œ๊ฐํ™”๋„ ํ• ํ…๋ฐ์š”, MBConv3-3x3๊ณผ ๊ฐ™์€ ์ด๋ฆ„์ด ๋‚˜์˜ฌ๊ฑฐ์—์š”. ๊ฐ๊ฐ expand ratio e์™€ kernel size of the depthwise convolution layer k๋กœ MBConv{e}-{k}x{k}๊ฐ€ ๋‚˜ํƒ€๋‚˜๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์•„์š”.

More Explanation to understand OFA-MCUNet

๊ณผ์ œ๋ฅผ ๋“ค์–ด๊ฐ€๊ธฐ ์•ž์„œ OFA-MCUNet์— ๋Œ€ํ•ด์„œ ์กฐ๊ธˆ ์„ค๋ช…ํ• ๊นŒํ•ด์š”. ๋‚ด๋ ค๊ฐ€๋ฉด์„œ ๋ชจ๋ธ ๊ตฌ์กฐ์— ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์ด ๋‚˜์˜ค๋Š” ๋ฐ ๊ฐ ์˜๋ฏธ๋ฅผ ์•Œ๋ฉด ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ๋” ์ˆ˜์›”ํ•  ๊ฒ๋‹ˆ๋‹ค.

๋ชจ๋ธ์€ ์ด first_conv, blocks, feature_mix_layer, classifier ์œผ๋กœ ๊ตฌ์„ฑํ•ด์š”. block์—์„œ๋„ ์ฒซ ๋ฒˆ์งธ, ๋งˆ์ง€๋ง‰ block์„ ์ œ์™ธํ•œ ์ด 6๊ฐœ์˜ block์—์„œ kernel size, expand ratio, depth, width multiply๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํ•ด์„œ ๋ชจ๋ธ์„ ํ‚ค์šฐ๊ฑฐ๋‚˜, ์ค„์ด์ฃ . ๊ฐ๊ฐ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ข€ ๋” ์‚ดํŽด๋ณด์ฃ !

1. Kernel size

kernel size๋Š” Convolution์— ๋‚˜์˜ค๋Š” ๊ทธ kernel์ด ๋งž์Šต๋‹ˆ๋‹ค. ์˜ˆ์ œ์—์„œ๋Š” 3x3, 5x5, 7x7๋กœ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์–ด์š”.

2. Width multiply, Depth

OFA MCUNet์„ ๋ธ”๋Ÿญ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ฃ . ๊ทธ์ค‘ ์ดˆ๋ก์ƒ‰์œผ๋กœ ์น ํ•ด์ง„ Block์„ ๋ณด์‹œ๋ฉด, Block์œผ๋กœ ๋“ค์–ด์˜ค๋Š” Input Channel๊ณผ Output Channel์ด ์žˆ์–ด์š”. ๋ฐ”๋กœ ๊ทธ ๋‘˜์„ ์–ผ๋งˆ๋‚˜ ์ค„์ผ ๊ฒƒ์ธ๊ฐ€, ์œ ์ง€ํ•  ๊ฒƒ์ธ๊ฐ€๊ฐ€ Width multiply์ž…๋‹ˆ๋‹ค.

๋‘ ๋ฒˆ์งธ๋กœ ํ•˜๋‚˜์˜ Block์€ MBConv(MobileNet Conv)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ด MBConv๊ฐ€ ๋ช‡ ๊ฐœ๊ฐ€ ๋“ค์–ด ๊ฐˆ ๊ฒƒ์ด๋ƒ๊ฐ€ ๊ด€๊ฑด์ผ ํ…๋ฐ์š”, ์ด๊ฑธ ์ •ํ•˜๋Š” ๊ฒƒ์ด Depth์ž…๋‹ˆ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ์—์„œ๋Š” depth_list์™€ base_depth๋กœ ๋‚˜๋ˆ ์„œ ๊ฐ block๋ณ„๋กœ base_depth๋ฅผ ๊ธฐ์ค€์œผ๋กœ depth_list์— ๋‚˜์˜ค๋Š” ๊ฐœ์ˆ˜ ๋งŒํผ ๋” MBConv์ด ์ถ”๊ฐ€๋˜์ฃ .

๋งˆ์ง€๋ง‰์€ expand ratio ์ž…๋‹ˆ๋‹ค. ์ด ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” MBConv ๋‚ด์—์„œ ์žˆ์–ด์š”, ์—ญ์‹œ๋‚˜ ๊ทธ๋ฆผ์„ ๋ณด์‹œ์ฃ . MBConv๋Š” MobileNet Convolution, Separable Convolution,

SE-Block, ๊ทธ๋ฆฌ๊ณ  ๋‹ค์‹œ MobileNet Convolution์œผ๋กœ ๊ตฌ์„ฑ๋˜์š”. ๊ทธ ์ค‘, ์ฒ˜์Œ ์ž…๋ ฅ์˜ ์ฑ„๋„๊ณผ ์ฒซ MobileNet Convolution์„ ๊ฑฐ์น˜๊ณ  ๋‚˜์˜จ ์ถœ๋ ฅ ์ฑ„๋„์˜ ๋น„๋ฅผ Expand ratio๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

-
# OFAMCUNets
# constitutes: first_conv, blocks, feature_mix_layer, classifier
# total 9 block (first_conv, first block, blocks, last block)
# 1. first_conv = 1x1 channel inc conv (3 -> X)
# 2. first block = MB InvertedConvLayer

# 3. blocks
# - depth = num block
# - 1 block = MobileInvertedResidualBlock = MBConvLayer + Residual
#############################################################
# Dynamic MBConvLayer = 2 times channel expansion           #
#                  fuse_blk1    se_stage                    #
# MBConvLayer + SeparableConv + SEBlock + MBConvLayer       #
#############################################################
# SEblock: conv 1x1 (reduce) -> act -> conv 1x1 (expand) -> h_sigmoid
# -> SENet(Squeeze-and-Excitation Network)

# 4. Last block = Mobile Inverted Residual Blcok
# 5. feature_mix_layer = 1x1 channel dec conv
# 6. classifier = linear layer

# Parameters (sample_active_subnet)
# kernel size, expand ratio, depth, width multiplY

์ฝ”๋“œ ์ค‘ make_divisible์ด๋ผ๋Š” ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„  ์ฑ„๋„์„ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ ์ค„์ผ ๋•Œ 8๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. tensorflow์—์„œ๋„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ์ด์œ ๋Š” ์•„์ง ๋ชจ๋ฅด๊ฒ ๋„ค์š”!

def make_divisible(v, divisor, min_val=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    :param v:
    :param divisor:
    :param min_val:
    :return:
    """
    if min_val is None:
        min_val = divisor
    new_v = max(min_val, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v

TL;DR. Summary

์‹คํ—˜์€ ์ด 4 ๋‹จ๊ณ„๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” kernel size, expand ratio, depth, width multiply์ฃ .

1. OFA-MCUNet

์ฒ˜์Œ์€ ํ›ˆ๋ จ๋œ vww_supernet์„ ๊ฐ€์ง€๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ๋งˆ๋‹ค accuracy ์กฐํ•ฉ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ ๊ฒฐ๊ณผ๋งˆ๋‹ค ์ดํ›„์— constraint ๋ฒ”์œ„ ๋‚ด์— ๋“ค์–ด์˜ค๋Š” ๋ชจ๋ธ๊ตฌ์กฐ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด MAC๊ณผ Peak memory ๋˜ํ•œ ๊ตฌํ•  ๊ฒ๋‹ˆ๋‹ค.

2. Accuracy Predictor

์•ž์„œ์„œ ๊ตฌํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋งˆ๋‹ค Accuracy๋ฅผ ๊ฐ€์ง€๊ณ , ์ด๋ฒˆ์—” ๋ฐ˜๋Œ€๋กœ ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  Accuracy๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ๊ฒ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ Linear Layer๊ฐ€ ์„ธ์ธต์œผ๋กœ ์Œ“์—ฌ์žˆ๋Š” ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์ฃ . ํ•˜์ง€๋งŒ ํŒŒํƒ€๋ฏธํ„ฐ ์กฐํ•ฉ์„ Embedding vector๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด encoder๊ฐ€ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.

3. Encoding: MCUNetArchEncoder

๊ทธ ๊ณผ์ •์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐํ•ฉ์„ Embedding vector๋กœ Encoding์„ ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Kenral size๊ฐ€ 3x3, 5x5, 7x7 ์ด ์žˆ๋Š” ๊ฒฝ์šฐ ๊ฐ๊ฐ์„ (0, 0, 1), (0, 1, 0), (1, 0, 0) ์ด๋ ‡๊ฒŒ encoding ํ•˜๋Š” ๊ฑฐ์ฃ . ์ด encoding์ด ๋“ค์–ด๊ฐ„ Accuracy Predictor ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ต๋‹ˆ๋‹ค. ํ›ˆ๋ จ์‹œํ‚จ ๋ชจ๋ธ์˜ Prediction๊ณผ Label ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ Linearํ•˜๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒƒ ๋˜ํ•œ ๋ณด์—ฌ์ค„ ๊ฒ๋‹ˆ๋‹ค.

OFA_networkโ€™s forward

๋ชจ๋ธ ์‹คํ—˜ํ•˜๊ธฐ์— ์•ž์„œ์„œ, ์ฑ„๋„์„ ๋งŒ์•ฝ ์ค„์ธ๋‹ค๋ฉด ์–ด๋–ค์‹์œผ๋กœ ํ• ์ง€ Convolution Network์—์„œ ๋‚˜์˜จ ์ฝ”๋“œ๋ฅผ ๊ฐ€์ ธ์™€๋ดค์–ด์š”. ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋งž๊ฒŒ ๊ฒฐ์ •ํ•œ out_channel, in_channel์„ ์•„๋ž˜ ์ฝ”๋“œ ์ฒ˜๋Ÿผ ์ž˜๋ผ active subnet์ด๋ผ๊ณ  ๋ถ€๋ฅผ ๊ฑฐ์—์š”. ์‹คํ—˜์€ ์ œ๊ฐ€ ์ž„์˜๋กœ ์ด๋ฏธ์ง€ ์‚ฌ์ด์ฆˆ๋ฅผ 48, 96, 128, 256, 384, 512๋กœ ํ‚ค์›Œ๋‚˜๊ฐ€๋ฉด์„œ ํ–ˆ๊ณ , sub network๋กœ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” random, max, min์œผ๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค.

filters = self.conv.weight[:out_channel, :in_channel, :, :].contiguous()
padding = get_same_padding(self.kernel_size)
y = F.conv2d(x, filters, None, self.stride, padding, self.dilation, 1)

ํฅ๋ฏธ๋กœ์› ๋˜ ๊ฑด ์ด๋ฏธ์ง€๊ฐ€ ์ปค์ง€๋ฉด ์ปค์งˆ์ˆ˜๋ก Accuracy๋Š” ๊ณ„์† ์˜ฌ๋ผ๊ฐ€๋‹ค๊ฐ€ 512์—์„œ ๋ถ€ํ„ฐ ๋–จ์–ด์ง€๋”๋ผ๊ตฌ์š”. ์‹คํ—˜๊ฒฐ๊ณผ๋Š” ์•„๋ž˜๋ฅผ ์ฐธ๊ณ ๋ฐ”๋ž๋‹ˆ๋‹ค.

# sample_active_subnet
# kernel size, expand ratio, depth, width mult

image_size = 48

cfg = ofa_network.sample_active_subnet(sample_function=random.choice, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, cfg)
visualize_subnet(cfg)
print(f"The accuracy of the sampled subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

largest_cfg = ofa_network.sample_active_subnet(sample_function=max, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, largest_cfg)
visualize_subnet(largest_cfg)
print(f"The largest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

smallest_cfg = ofa_network.sample_active_subnet(sample_function=min, image_size=image_size)
acc, peak_memory, macs, params = evaluate_sub_network(ofa_network, smallest_cfg)
visualize_subnet(smallest_cfg)
print(f"The smallest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 51.09it/s, loss=0.603, top1=65.9]
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 53.97it/s, loss=0.625, top1=64.2]
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 51.76it/s, loss=0.718, top1=59.3]

The accuracy of the sampled subnet: #params= 1.6M, accuracy= 65.9%.
The largest subnet: #params= 2.5M, accuracy= 64.2%.
The smallest subnet: #params= 0.3M, accuracy= 59.3%.

image_size = 96

cfg = ofa_network.sample_active_subnet(sample_function=random.choice, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, cfg)
visualize_subnet(cfg)
print(f"The accuracy of the sampled subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

largest_cfg = ofa_network.sample_active_subnet(sample_function=max, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, largest_cfg)
visualize_subnet(largest_cfg)
print(f"The largest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

smallest_cfg = ofa_network.sample_active_subnet(sample_function=min, image_size=image_size)
acc, peak_memory, macs, params = evaluate_sub_network(ofa_network, smallest_cfg)
visualize_subnet(smallest_cfg)
print(f"The smallest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 35.68it/s, loss=0.321, top1=86.4]
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 42.76it/s, loss=0.29, top1=88.6] 
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 44.92it/s, loss=0.379, top1=83.4]

The accuracy of the sampled subnet: #params= 0.6M, accuracy= 86.4%.
The largest subnet: #params= 2.5M, accuracy= 88.6%.
The smallest subnet: #params= 0.3M, accuracy= 83.4%.

image_size = 128

# sample_active_subnet
# kernel size, expand ratio, depth, width mult

cfg = ofa_network.sample_active_subnet(sample_function=random.choice, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, cfg)
visualize_subnet(cfg)
print(f"The accuracy of the sampled subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

largest_cfg = ofa_network.sample_active_subnet(sample_function=max, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, largest_cfg)
visualize_subnet(largest_cfg)
print(f"The largest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

smallest_cfg = ofa_network.sample_active_subnet(sample_function=min, image_size=image_size)
acc, peak_memory, macs, params = evaluate_sub_network(ofa_network, smallest_cfg)
visualize_subnet(smallest_cfg)
print(f"The smallest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 39.53it/s, loss=0.228, top1=91.3]
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:01<00:00, 30.92it/s, loss=0.21, top1=92.3] 
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 40.69it/s, loss=0.307, top1=87.3]

The accuracy of the sampled subnet: #params= 1.3M, accuracy= 91.3%.
The largest subnet: #params= 2.5M, accuracy= 92.3%.
The smallest subnet: #params= 0.3M, accuracy= 87.3%.

image_size = 256

# sample_active_subnet
# kernel size, expand ratio, depth, width mult

cfg = ofa_network.sample_active_subnet(sample_function=random.choice, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, cfg)
visualize_subnet(cfg)
print(f"The accuracy of the sampled subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

largest_cfg = ofa_network.sample_active_subnet(sample_function=max, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, largest_cfg)
visualize_subnet(largest_cfg)
print(f"The largest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

smallest_cfg = ofa_network.sample_active_subnet(sample_function=min, image_size=image_size)
acc, peak_memory, macs, params = evaluate_sub_network(ofa_network, smallest_cfg)
visualize_subnet(smallest_cfg)
print(f"The smallest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:01<00:00, 19.93it/s, loss=0.187, top1=93.5]
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:03<00:00, 10.12it/s, loss=0.177, top1=93.9] 
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:01<00:00, 25.67it/s, loss=0.258, top1=90.2]

The accuracy of the sampled subnet: #params= 0.6M, accuracy= 93.5%.
The largest subnet: #params= 2.5M, accuracy= 93.9%.
The smallest subnet: #params= 0.3M, accuracy= 90.2%.

image_size = 256+128

# sample_active_subnet
# kernel size, expand ratio, depth, width mult

cfg = ofa_network.sample_active_subnet(sample_function=random.choice, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, cfg)
visualize_subnet(cfg)
print(f"The accuracy of the sampled subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

largest_cfg = ofa_network.sample_active_subnet(sample_function=max, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, largest_cfg)
visualize_subnet(largest_cfg)
print(f"The largest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

smallest_cfg = ofa_network.sample_active_subnet(sample_function=min, image_size=image_size)
acc, peak_memory, macs, params = evaluate_sub_network(ofa_network, smallest_cfg)
visualize_subnet(smallest_cfg)
print(f"The smallest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:03<00:00,  8.16it/s, loss=0.241, top1=91.1] 
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:06<00:00,  4.60it/s, loss=0.263, top1=90.5] 
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:02<00:00, 12.13it/s, loss=0.34, top1=85.4] 

The accuracy of the sampled subnet: #params= 1.1M, accuracy= 91.1%.
The largest subnet: #params= 2.5M, accuracy= 90.5%.
The smallest subnet: #params= 0.3M, accuracy= 85.4%.

image_size = 512

# sample_active_subnet
# kernel size, expand ratio, depth, width mult

cfg = ofa_network.sample_active_subnet(sample_function=random.choice, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, cfg)
visualize_subnet(cfg)
print(f"The accuracy of the sampled subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

largest_cfg = ofa_network.sample_active_subnet(sample_function=max, image_size=image_size)
acc, _, _, params = evaluate_sub_network(ofa_network, largest_cfg)
visualize_subnet(largest_cfg)
print(f"The largest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")

smallest_cfg = ofa_network.sample_active_subnet(sample_function=min, image_size=image_size)
acc, peak_memory, macs, params = evaluate_sub_network(ofa_network, smallest_cfg)
visualize_subnet(smallest_cfg)
print(f"The smallest subnet: #params={params/1e6: .1f}M, accuracy={acc: .1f}%.")
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:06<00:00,  5.31it/s, loss=0.376, top1=83.1]
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:11<00:00,  2.67it/s, loss=0.413, top1=81]   
Validate: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:04<00:00,  7.23it/s, loss=0.489, top1=76.1]

The accuracy of the sampled subnet: #params= 0.5M, accuracy= 83.1%.
The largest subnet: #params= 2.5M, accuracy= 81.0%.
The smallest subnet: #params= 0.3M, accuracy= 76.1%.

Question 1: Design space exploration.

Try manually sample different subnets by running the cell above multiple times. You can also vary the input resolution. Talk about your findings.

Hint: which dimension plays the most important role for the accuracy?

Answer: Image resolution plays the most important role for classification accuracy.

๋„ค, ์งˆ๋ฌธ์—์„œ ์‚ฌ์‹ค ํžŒํŠธ๋ฅผ ์–ป์–ด ์‹คํ—˜์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. โ€œImage resolution์— ๋”ฐ๋ฅธ Accuracy ๋ณ€ํ™”โ€๋ฅผ ์•Œ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

Part 1. Predictors

์ด์ œ ๋‘๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์•ž์„œ์„œ ๋ชจ๋ธ์„ ํ†ตํ•ด ์–ป์€ VWW dataset์œผ๋กœ Accuracy๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ฒ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ์ƒ๊ฐ๋ชจ๋‹ค ๊ฐ„๋‹จํ•ด์š”, Linear ์„ธ ์ธต์œผ๋กœ ๊ตฌ์„ฑ๋ผ ์žˆ์ฃ . ์•„๋ž˜ ๊ทธ๋žจ์€ ๊ถ๊ทน์ ์œผ๋กœ Constraint์— ํ•ด๋‹นํ•˜๋Š” ๋ชจ๋ธ์„ ์šฐ๋ฆฌ๋Š” ๊ตฌํ• ๊ฑฐ๋‹ค, ์ด๋Ÿฐ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

Efficiency predictor๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ๊ฒฐ์ •๋˜๋ฉด Accuracy์™€ ํ•จ๊ป˜ ๋‚˜์˜ฌ๊ฑฐ์—์š”. ์•ž์„  ์˜ˆ์ œ์—์„œ ํ–ˆ์œผ๋‹ˆ ๊ธฐ์–ต์ด ์•ˆ๋‚˜์‹ ๋‹ค๋ฉด ์ด์ „ ์˜ˆ์ œ๋กœ!

Question 2: Implement the efficiency predictor.

์ฒ˜์Œ์€ โ€œAnalyticalEfficiencyPredictorโ€๋ผ๋Š” ํด๋ž˜์Šค๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ํฌ๊ธฐ์— ๋”ฐ๋ผ MAC๊ณผ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•ด์ฃผ๊ณ (get_efficiency), ์ด ๋‘๊ฐ€์ง€๊ฐ€ ํƒ€๊ฒŸํ•˜๊ณ  ๋ถ€ํ•ฉํ•˜๋Š”์ง€๋„ ์•Œ๋ ค์ฃผ๋Š” ํ•จ์ˆ˜(satisfy_constraint)๋„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. FLOP๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์‚ฐ์€ ๊ต์ˆ˜๋‹˜์ด ์นœ์ ˆํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ๋†“์œผ์‹  count_net_flops๊ณผ count_peak_activation_size๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

class AnalyticalEfficiencyPredictor:
    def __init__(self, net):
        self.net = net

    def get_efficiency(self, spec: dict):
        self.net.set_active_subnet(**spec)
        subnet = self.net.get_active_subnet()
        if torch.cuda.is_available():
            subnet = subnet.cuda()
        ############### YOUR CODE STARTS HERE ###############
        # Hint: take a look at the `evaluate_sub_network` function above.
        # Hint: the data shape is (batch_size, input_channel, image_size, image_size)
        data_shape = (1, 3, spec["image_size"], spec["image_size"])
        macs = count_net_flops(subnet, data_shape)
        peak_memory = count_peak_activation_size(subnet, data_shape)
        ################ YOUR CODE ENDS HERE ################

        return dict(millionMACs=macs / 1e6, KBPeakMemory=peak_memory / 1024)

    def satisfy_constraint(self, measured: dict, target: dict):
        for key in measured:
            # if the constraint is not specified, we just continue
            if key not in target:
                continue
            # if we exceed the constraint, just return false.
            if measured[key] > target[key]:
                return False
        # no constraint violated, return true.
        return True

Letโ€™s test your implementation for the analytical efficiency predictor by examining the returned values for the smallest and largest subnets we just evaluated a while ago. The results from the efficiency predictor should match with the previous results.

efficiency_predictor = AnalyticalEfficiencyPredictor(ofa_network)

image_size = 96
# Print out the efficiency of the smallest subnet.
smallest_cfg = ofa_network.sample_active_subnet(sample_function=min, image_size=image_size)
eff_smallest = efficiency_predictor.get_efficiency(smallest_cfg)

# Print out the efficiency of the largest subnet.
largest_cfg = ofa_network.sample_active_subnet(sample_function=max, image_size=image_size)
eff_largest = efficiency_predictor.get_efficiency(largest_cfg)

print("Efficiency stats of the smallest subnet:", eff_smallest)
print("Efficiency stats of the largest subnet:", eff_largest)
Efficiency stats of the smallest subnet: {'millionMACs': 8.302128, 'KBPeakMemory': 72.0}
Efficiency stats of the largest subnet: {'millionMACs': 79.416432, 'KBPeakMemory': 270.0}

Question 3: Implement the accuracy predictor.

์ด์ œ Accuracy predictor๋ฅผ ๋งŒ๋“ค์–ด์•ผ์ฃ ? ๊ทธ์ „์—, ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ฃผ์–ด์ง„ ๊ฑธ ์‚ดํŽด๋ณด๋‹ˆ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ„ ์กฐํ•ฉ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐ์ดํ„ฐ๋กœ์จ ์“ฐ๊ธฐ ์œ„ํ•ด ์ž„๋ฒ ๋”ฉ์„ ํ•ด์•ผํ•˜๋Š”๋ฐ ๊ทธ ์—ญํ• ์„ ๋ฐ”๋กœ MCUNetArchEncoder๊ฐ€ ํ•ฉ๋‹ˆ๋‹ค. ์—ญ์‹œ๋‚˜ ๊ต์ˆ˜๋‹˜์ด ์นœ์ ˆํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ์…จ๊ตฐ์š”. ๊ทธ๋ฆฌ๊ณ  Accuracy predictor ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” MLP (multi-layer perception)๋ฅผ ์‚ฌ์šฉํ• ๊ฒ๋‹ˆ๋‹ค.

The accuracy predictor takes in the architecture of a sub-network and predicts its accuracy on the VWW dataset. Since it is an MLP network, the sub-network must be encoded into a vector. In this lab, we provide a class MCUNetArchEncoder to perform such conversion from sub-network architecture to a binary vector.

image_size_list = [96, 112, 128, 144, 160]
arch_encoder = MCUNetArchEncoder(
    image_size_list=image_size_list,
    base_depth=ofa_network.base_depth,
    depth_list=ofa_network.depth_list,
    expand_list=ofa_network.expand_ratio_list,
    width_mult_list=ofa_network.width_mult_list,
)

We generated an accuracy dataset beforehand, which is a collection of [architecture, accuracy] pairs stored under the acc_datasets folder.

With the architecture encoder, you are now required define the accuracy predictor, which is a multi-layer perception (MLP) network with 400 channels per intermediate layer. For simplicity, we fix the number of layers to be 3. Please implement this MLP network in the following cell.

class AccuracyPredictor(nn.Module):
    def __init__(
        self,
        arch_encoder,
        hidden_size=400,
        n_layers=3,
        checkpoint_path=None,
        device="cuda:0",
    ):
        super(AccuracyPredictor, self).__init__()
        self.arch_encoder = arch_encoder
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.device = device

        layers = []
        
        ############### YOUR CODE STARTS HERE ###############
        # Let's build an MLP with n_layers layers. 
        # Each layer (nn.Linear) has hidden_size channels and 
        # uses nn.ReLU as the activation function.
        # Hint: You can assume that n_layers is fixed to be 3, for simplicity.
        # Hint: the input dimension of the first layer is not hidden_size.
        for i in range(self.n_layers):
            layers.append(
                nn.Sequential(
                    nn.Linear(
                        self.arch_encoder.n_dim if i == 0 else self.hidden_size,
                        self.hidden_size,
                    ),
                    nn.ReLU(inplace=True),
                )
            )
        ################ YOUR CODE ENDS HERE ################
        layers.append(nn.Linear(self.hidden_size, 1, bias=False))
        self.layers = nn.Sequential(*layers)
        self.base_acc = nn.Parameter(
            torch.zeros(1, device=self.device), requires_grad=False
        )

        if checkpoint_path is not None and os.path.exists(checkpoint_path):
            checkpoint = torch.load(checkpoint_path, map_location="cpu")
            if "state_dict" in checkpoint:
                checkpoint = checkpoint["state_dict"]
            self.load_state_dict(checkpoint)
            print("Loaded checkpoint from %s" % checkpoint_path)

        self.layers = self.layers.to(self.device)

    def forward(self, x):
        y = self.layers(x).squeeze()
        return y + self.base_acc

    def predict_acc(self, arch_dict_list):
        X = [self.arch_encoder.arch2feature(arch_dict) for arch_dict in arch_dict_list]
        X = torch.tensor(np.array(X)).float().to(self.device)
        return self.forward(X)

Letโ€™s print out the architecture of the AccuracyPredictor you just defined.

os.makedirs("pretrained", exist_ok=True)
acc_pred_checkpoint_path = (
    f"pretrained/{ofa_network.__class__.__name__}_acc_predictor.pth"
)
acc_predictor = AccuracyPredictor(
    arch_encoder,
    hidden_size=400,
    n_layers=3,
    checkpoint_path=None,
    device=device,
)
print(acc_predictor)
AccuracyPredictor(
  (layers): Sequential(
    (0): Sequential(
      (0): Linear(in_features=128, out_features=400, bias=True)
      (1): ReLU(inplace=True)
    )
    (1): Sequential(
      (0): Linear(in_features=400, out_features=400, bias=True)
      (1): ReLU(inplace=True)
    )
    (2): Sequential(
      (0): Linear(in_features=400, out_features=400, bias=True)
      (1): ReLU(inplace=True)
    )
    (3): Linear(in_features=400, out_features=1, bias=False)
  )
)

๋ฐ์ดํ„ฐ ์…‹์€ ์ด 4๋งŒ๊ฐœ์˜ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ์™€ ๋งŒ๊ฐœ์˜ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์žˆ๊ณ , Accuracy๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ(architecture)์™€ ์Œ์„ ์ด๋ฃฐ๊ฑฐ๋ผ๋Š”, ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค. ํ•˜๋‚˜ ๋”, ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ one-hot representation ๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •๋„ ์žŠ์ง€๋งˆ์‹œ์ฃ ! ๋‹ค์Œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์‹œ๋ฉด โ€œkernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 1 0] => expand ratio: 4โ€ ์ด๋Ÿฌ๋ฉด์„œ ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ์ž„๋ฒ ๋”ฉ๋œ ๊ฑธ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์–ด์š”

Letโ€™s first visualize some samples in the accuracy dataset in the following cell.

The accuracy dataset is composed of 50,000 [architecture, accuracy] pairs, where 40,000 of them are used as the training set and the rest 10,000 are used as validation set.

For accuracy, We calculate the average accuracy of all [architecture, accuracy] pairs on the accuracy dataset and define it as base_acc. For the accuracy predictor, instead of directly regressing the accuracy of each architecture, its training target is accuracy - base_acc. Since accuracy - base_acc is usually much smaller than accuracy itself, this can make training easier.

For architecture, each subnet within the design space is uniquely represented by a binary vector. The binary vector is a concatenation of the one-hot representation for both global parameters (e.g. input resolution, width multiplier) and parameters of each inverted MobileNet block (e.g. kernel sizes and expand ratios). Note that we prefer one-hot representations over numerical representations because all design hyperparameters are discrete values.

For example, our design space supports

kernel_size = [3, 5, 7]
expand_ratio = [3, 4, 6]

Then, we represent kernel_size=3 as [1, 0, 0], kernel_size=5 as [0, 1, 0], and kernel_size=7 as [0, 0, 1]. Similarly, for expand_ratio=3, it is written as [1, 0, 0]; expand_ratio=4 is written as [0, 1, 0] and expand_ratio=6 is written as [0, 0, 1]. The representation for each inverted MobileNet block is obtained by concatenating the kernel size embedding with the expand ratio embedding. Note that for skipped blocks, we use [0, 0, 0] to represent their kernel sizes and expand ratios. You will see a detailed explanation of the architecture-embedding correspondence after running the following cell.

acc_dataset = AccuracyDataset("acc_datasets")
train_loader, valid_loader, base_acc = acc_dataset.build_acc_data_loader(
    arch_encoder=arch_encoder
)

print(f"The basic accuracy (mean accuracy of all subnets within the dataset is: {(base_acc * 100): .1f}%.")

# Let's print one sample in the training set
sampled = 0
for (data, label) in train_loader:
    data = data.to(device)
    label = label.to(device)
    print("=" * 100)
    # dummy pass to print the divided encoding
    arch_encoding = arch_encoder.feature2arch(data[0].int().cpu().numpy(), verbose=False)
    # print out the architecture encoding process in detail
    arch_encoding = arch_encoder.feature2arch(data[0].int().cpu().numpy(), verbose=True)
    visualize_subnet(arch_encoding)
    print(f"The accuracy of this subnet on the holdout validation set is: {(label[0] * 100): .1f}%.")
    sampled += 1
    if sampled == 1:
        break
Loading data: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 50000/50000 [00:00<00:00, 228025.66it/s]
Train Size: 40000, Valid Size: 10000
The basic accuracy (mean accuracy of all subnets within the dataset is:  90.3%.
====================================================================================================
network embedding: [1 0 0 0 0 | 0 1 0 | 0 1 0 | 0 1 0 | 1 0 0 | 0 0 1 | 1 0 0 | 1 0 0 | 0 0 1 | 1 0 0 | 0 1 0 | 0 1 0 | 0 0 1 | 0 0 1 | 0 0 0 | 0 0 0 | 0 1 0 | 0 0 1 | 0 1 0 | 0 0 1 | 0 1 0 | 0 1 0 | 0 1 0 | 0 0 1 | 1 0 0 | 1 0 0 | 0 1 0 | 0 1 0 | 0 0 1 | 0 0 1 | 0 1 0 | 0 0 1 | 0 0 1 | 1 0 0 | 0 1 0 | 0 0 1 | 0 0 0 | 0 0 0 | 0 0 0 | 0 0 0 | 0 1 0 | 0 0 1]
image resolution embedding: [1 0 0 0 0] => image resolution: 96
width multiplier embedding: [0 1 0] => width multiplier: 0.75
**************************************************Stage1**************************************************
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 1 0] => expand ratio: 4
kernel size embedding: [1 0 0] => kernel size: 3; expand ratio embedding: [0 0 1] => expand ratio: 6
kernel size embedding: [1 0 0] => kernel size: 3; expand ratio embedding: [1 0 0] => expand ratio: 3
**************************************************Stage2**************************************************
kernel size embedding: [0 0 1] => kernel size: 7; expand ratio embedding: [1 0 0] => expand ratio: 3
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 1 0] => expand ratio: 4
kernel size embedding: [0 0 1] => kernel size: 7; expand ratio embedding: [0 0 1] => expand ratio: 6
kernel size embedding: [0 0 0] expand ratio embedding: [0 0 0] => layer skipped.
**************************************************Stage3**************************************************
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 0 1] => expand ratio: 6
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 0 1] => expand ratio: 6
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 1 0] => expand ratio: 4
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 0 1] => expand ratio: 6
**************************************************Stage4**************************************************
kernel size embedding: [1 0 0] => kernel size: 3; expand ratio embedding: [1 0 0] => expand ratio: 3
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 1 0] => expand ratio: 4
kernel size embedding: [0 0 1] => kernel size: 7; expand ratio embedding: [0 0 1] => expand ratio: 6
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 0 1] => expand ratio: 6
**************************************************Stage5**************************************************
kernel size embedding: [0 0 1] => kernel size: 7; expand ratio embedding: [1 0 0] => expand ratio: 3
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 0 1] => expand ratio: 6
kernel size embedding: [0 0 0] expand ratio embedding: [0 0 0] => layer skipped.
kernel size embedding: [0 0 0] expand ratio embedding: [0 0 0] => layer skipped.
**************************************************Stage6**************************************************
kernel size embedding: [0 1 0] => kernel size: 5; expand ratio embedding: [0 0 1] => expand ratio: 6
The accuracy of this subnet on the holdout validation set is:  88.7%.

Question 4: Complete the code for accuracy predictor training.

ํ›ˆ๋ จํ•  ์‹œ๊ฐ„์ž…๋‹ˆ๋‹ค!

criterion = torch.nn.L1Loss().to(device)
optimizer = torch.optim.Adam(acc_predictor.parameters())
# the default value is zero
acc_predictor.base_acc.data += base_acc
for epoch in tqdm(range(10)):
    acc_predictor.train()
    for (data, label) in tqdm(train_loader, desc="Epoch%d" % (epoch + 1), position=0, leave=True):
        # step 1. Move the data and labels to device (cuda:0).
        data = data.to(device)
        label = label.to(device)
        ############### YOUR CODE STARTS HERE ###############
        # step 2. Run forward pass.
        pred = acc_predictor(data)
        # step 3. Calculate the loss.
        loss = criterion(pred, label)
        # step 4. Perform the backward pass.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        ################ YOUR CODE ENDS HERE ################

    acc_predictor.eval()
    with torch.no_grad():
        with tqdm(total=len(valid_loader), desc="Val", position=0, leave=True) as t:
            for (data, label) in valid_loader:
                # step 1. Move the data and labels to device (cuda:0).
                data = data.to(device)
                label = label.to(device)
                ############### YOUR CODE STARTS HERE ###############
                # step 2. Run forward pass.
                pred = acc_predictor(data)
                # step 3. Calculate the loss.
                loss = criterion(pred, label)
                ############### YOUR CODE ENDS HERE ###############
                t.set_postfix({"loss": loss.item()})
                t.update(1)

if not os.path.exists(acc_pred_checkpoint_path):
    torch.save(acc_predictor.cpu().state_dict(), acc_pred_checkpoint_path)
Epoch1: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 362.86it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 109.00it/s, loss=0.00374]
Epoch2: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 262.66it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 141.77it/s, loss=0.0026]
Epoch3: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 241.87it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 118.13it/s, loss=0.00251]
Epoch4: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 336.42it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 119.41it/s, loss=0.00259]
Epoch5: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 331.75it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 117.39it/s, loss=0.00242]
Epoch6: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 341.96it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 96.35it/s, loss=0.00235]
Epoch7: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 321.68it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 122.19it/s, loss=0.0023]
Epoch8: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 307.33it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 121.72it/s, loss=0.00178]
Epoch9: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 329.76it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 119.59it/s, loss=0.00203]
Epoch10: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 157/157 [00:00<00:00, 308.76it/s]
Val: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 40/40 [00:00<00:00, 99.72it/s, loss=0.00195] 
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:08<00:00,  1.17it/s]

ํ›ˆ๋ จํ•œ ๋ชจ๋ธ์˜ Prediction๊ณผ ์‹ค์ œ ์ˆ˜์น˜์™€ Corrleation์ด ๊ทธ๋ž˜ํ”„๋กœ ๋ณด์ด๋„ค์š”. โ€œLinearโ€ ํ•ฉ๋‹ˆ๋‹ค.

predicted_accuracies = []
ground_truth_accuracies = []
acc_predictor = acc_predictor.to("cuda:0")
acc_predictor.eval()
with torch.no_grad():
    with tqdm(total=len(valid_loader), desc="Val") as t:
        for (data, label) in valid_loader:
            data = data.to(device)
            label = label.to(device)
            pred = acc_predictor(data)
            predicted_accuracies += pred.cpu().numpy().tolist()
            ground_truth_accuracies += label.cpu().numpy().tolist()
            if len(predicted_accuracies) > 200:
                break
plt.scatter(predicted_accuracies, ground_truth_accuracies)
# draw y = x
min_acc, max_acc = min(predicted_accuracies), max(predicted_accuracies)
print(min_acc, max_acc)
plt.plot([min_acc, max_acc], [min_acc, max_acc], c="red", linewidth=2)
plt.xlabel("Predicted accuracy")
plt.ylabel("Measured accuracy")
plt.title("Correlation between predicted accuracy and real accuracy")
Val:   0%|          | 0/40 [00:00<?, ?it/s]
0.8604847192764282 0.9356203079223633
Text(0.5, 1.0, 'Correlation between predicted accuracy and real accuracy')