ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Deep Neural Networks with PyTorch, COURSERA
    Note-taking/Courses 2021. 1. 20. 09:02

    1.1 Tensors 1D

    a = torch.tensor([0,1,2,3,4])
    a.dtype -> torch.int64
    a.type() -> torch.LongTensor

    a = torch.tensor([0., 1., 2., 3., 4.], dtype=torch.int32)
    a.dtype = torch.int32

    a = torch.FloatTensor([0, 1, 2, 3, 4]) -> tensor([0., 1., 2., 3., 4.])
    a = a.type(torch.FloatTensor)

    a.size() -> torch.Size(5)
    a.ndimension() -> 1

    a_col = a.view(5, 1) = a.view(-1, 1) // .view(..) is the same as .reshape(..)

    numpy_array = np.array([0., 1., 2., 3., 4.])
    torch_tensor = torch.from_numpy(numpy_array) // .from_numpy(..) converts numpy-array to pytorch-tensor
    back_to_numpy = torch_tensor.numpy() // .numpy() converts pytorch-tensor to numpy-array

    torch_to_list = torch_tensor.tolist() -> [0., 1., 2., 3., 4.] // '.tolist()' converts pytorch-tensor to list

    torch_tensor[0] -> tensor(0.)
    torch_tensor[1] -> tensor(1.)

    torch_tensor[0].item() -> 0. // .item() results in a (int/float) value
    torch_tensor[1].item() -> 1.

    torch_tensor[0] = 100 // change the value in a tensor

    torch_tensor[1:4] // slicing is the same as other arrary

    Basic Operations

    u = torch.tensor([1., 0.])
    v = torch.tensor([0., 1.])
    z = u+v -> tensor([1., 1.])

    y = torch.tensor([1, 2])
    z = 2 * y -> tensor([2, 4])

    u = torch.tensor([1, 2])
    v = torch.tensor([3, 2])
    z = u*v -> tensor([3, 4])

    u = torch.tensor([1, 2])
    v = torch.tensor([3, 1])
    z = torch.dot(u, v) -> 5

    u = torch.tensor([1, 2, 3, -1])
    z = u+1 -> tensor([2, 3, 4, 0]) // broad-casting

    Universal Functions

    a = torch.tensor([1, -1, 1, -1])
    mean_a = a.mean() // .mean()

    b = torch.tensor([1, -2, 3, 4, 5])
    max_b = b.max() // .max()

    x = torch.tensor([0, np.pi/2, np.pi])
    y = torch.sin(x) -> tensor([sin(0), sin(pi/2), sin(pi)]) = tensor([0, 1, 0])

    torch.linspace(-2, 2, steps=5) -> tensor([-2, 1, 0, 1, 2])

    1.2 Two-Dimensional Tensors

    a = [[11, 12, 13], [21, 22, 23], [31, 32, 33]]
    A = torch.tensor(a) -> 3x3 matrix

    A.ndimensions() -> 2
    A.shape -> torch.Size(3, 3)
    A.numel() -> 9 // #elements

    Indexing, slicing, basic operations (e.g. add, subtract, multiply, divide) work the same way as with np.array

    A = torch.tensor([[0,1,1], [1,0,1]])
    B = torch.tensor([[1,1], [1,1], [-1,1]])
    C = torch.mm(A,B) // mm denotes the matrix multiplication.
    Note that the difference between the dot product and the matrix multiplication is: The dot product is defined between two vectors, and the matrix product is defined between two matrices.

    1.3 Derivatives in PyTorch

    Let's see the following example: We're trying to compute a derivative of $y$ w.r.t $x$:

    $$y(x) = x^2$$

    If $x=2$, then

    $$ y(2) = 2^2 = 4 $$

    $$ \frac{dy(x)}{dx} = 2x $$

    $$ \frac{dy(2)}{dx} = 2(2) = 4 $$

    Let's see how we can compute them with pytorch:

    x = torch.tesnor(2., requires_grad=True) // we want to evaluate a derivative of a function $y$ w.r.t $x$ at $x=2$. requires_grad=True allows to compute the derivative.
    y = x**2 -> $2.^2$
    y.backward() // backward() calculates a derivative of $y$.
    x.grad() // evaluates a derivative of $y$ at $x=2.$. It's the same as $\left( = \frac{dy(x=2)}{dx} = 4. \right)$.

    This is how $x$ and $y$ look internally.

    u = torch.tensor(1., requires_grad=True)
    v = torch.tensor(2., requires_grad=True)
    f = u*v + u**2 // $f(u, v) = uv + u^2$
    f.backward() // compute derivatives
    u.grad // partial derivative of $f$ w.r.t $u$ $= \frac{\partial f(u,v)}{du}$
    v.grad // partial derivative of $f$ w.r.t $v$ $= \frac{\partial f(u,v)}{dv}$

    1.3 Simple Dataset

    Build a Data Set Class and Object

    To create our own dataset class, we must import an abstract class called Dataset from from torch.utils.data import Dataset which is an abstract class.

    Toy dataset class

    __getitem__ allows an instance of class to be indexed and sliced like a list or an array (refer to here).

    Transform

    Almost every dataset needs to be transformed (e.g. normalizing). We usually build a class to handle it. An example is shown below:

    __call__ allows an instance of class to be callable (refer to here).

    Transforms Compose

    In many cases, we would like to run several transforms in series. We simply use from torchvision.transforms import Compose.

    Let's say you have two transformation classes:

    Then, we define:

    from torchvision import transforms
    
    data_transform = transforms.Compose([add_mult(), mult()])
    x_, y_ = data_transform(data_set[0])

    where data_transform takes an input and applies add_mult() first, and then mult(). After that, it outputs the transformed data.

    1.5 Dataset

    Dataset Class (for a large dataset)

    Process of building a dataset for images is similar to the one above, but we do not usually load all images into the memory. We can also apply a similar technique to a large (numerical) dataset.

    Further information about taking out a mini-batch is in Section 6.3.1.

    Torch Vision Transforms

    Torch Vision has a set of widely used pre-build transforms that are used on images. Here's an example:

    import torchvision.transforms as transforms
    
    transforms.CenterCrop(20)  // crop the edges of an image
    transforsm.ToTensor()  // convert to tensor
    
    croptensor_data_transform = transforms.Compose([ transforms.CenterCrop(20), transforms.ToTensor() ])
    
    dataset = Dataset(scv_file, data_dir, transforms=croptensor_data_transform)
    dataset[0][0].shape  // -> torch.Size([1, 20, 20])

    Torch Vision Datasets

    Torch Vision also has pre-built datasets that are commonly used to compare models.

    import torchvision.datasets as dsets
    
    dataset = dsets.MNIST(root="./data", download=True, transform=transforms.ToTensor())

    Linear Regression in 1D - Prediction

    Here, we model a linear regression as:

    $$ y = b + wx $$

    Assuming that we know $b=2$ and $w=-1$, the pytorch implementation is:

    b = torch.tensor(2, requires_grad=True)
    w = torch.tensor(-1, requires_grad=True)
    
    def forward(x):
        "defines a prediction process"
        y = b + w*x
        return y
    
    x = torch.arange(0, 5).view(-1, 1)
    yhat = forward(x)  // -> [2, 1, 0, -1, -2]^T

    There exist pytorch packages and classes for Linear computation.

    Custom Modules

    Using model.state_dict(), you can access and modify the parameter values:

    2.4 PyTorch Slope

    Linear Regression PyTorch

    3.1 Stochastic Gradient Descent and the Data Loader

    A procedure of the stochastic gradient descent is:

    1. For $epoch \in \{0,1, \cdots\}$
      1. For every sample
        1. Update the parameters

    The training is usually unstable due to the fact that variability in every sample (sample size: 1) is too large.

    Stochastic Gradient Descent DataLoader

    Note that DataLoader wraps the instance of Data to extend its functionality. If batch_size is larger than 1, it outputs the mini-batch.

    3.2 Mini-Batch Gradient Descent

    A procedure of the mini-batch gradient descent is:

    1. For $epoch \in \{0,1, \cdots\}$
      1. For_every mini-batch
        1. Update the parameters

    If the mini-batch size is the same as the entire sample size, it is just a gradient descent (with a batch).

    In PyTorch, you can implement the mini-batch iteration with

    trainloader = DataLoader(dataset, batch_size=2**5)

    3.3 Optimization in PyTorch

    MSE loss function can be defined by using:

    criterion = nn.MSELoss()

    Various optimizers are defined in torch.optim. Also, note that optimizer has .state_dict():

    from torch import optim
    
    model = LR(1, 1)  # linear regression model
    optimizer = optim.SGD(model.parameters(), lr= 0.01)

    Similar to the model, optimzer has .state_dict().

    Using the DataLoader, nn.MSELoss, and optim.SGD, the training code can be written as:

    Overall flow of the training

    4.1 Multiple Linear Regression Prediction

    For $Y = b + w_1 X_1 + w_2 X_2$

    Just as a reminder, once we compute a loss function loss, $J$, and run loss.backward(), the partial derivatives of $J$ w.r.t the weights are computed:
    $$ \frac{\partial J(b, w_1, w_2)}{\partial b}, \quad \frac{\partial J(b, w_1, w_2)}{\partial w_1}, \quad \frac{\partial J(b, w_1, w_2)}{\partial w_2} $$
    and they can be used to update the weights (= optimizer.step()).

    4.2 Multiple Output Linear Regression

    • #output = 2

    model = Linear(input_size=2, output_size=2)
    • #output = 3

    model = Linear(input_size=2, output_size=3)

    The training of the model is basically the same as Section 3.3.

    5.0 Linear Classifier and Logistic Regression

    A logistic regression model can be defined as:
    $$ \hat{y} = \frac{1}{1 + \exp( - z )} $$
    $$ z = b + \sum_{i=1}^p w_i x_i $$
    where $p$ is a number of features/predictors. The logistic function is basically the same as the sigmoid function.

    import torch.nn as nn
    
    z = torch.arange(-10., 10., 0.1).view(-1, 1)
    
    # Approach #1
    sig = nn.Sigmoid()
    yhat = sig(z)
    
    # Approach #2
    yhat = torch.sigmoid(z)

    nn.Sequential

    nn.Sequential is a really fast way to build a model.

    Let's build a logistic regression model using nn.Sequential:

    model = nn.Sequential(nn.Linear(1,1), nn.Sigmoid())
    yhat = model(x)

    Internal operation of the model

    Custom Module

    Building a logistic regression model using the custom module

    Binary Cross-Entropy Error - Loss Function

    For binary classification models, a binary cross-entropy loss function is commonly used. In PyTorch, you can use:

    criterion = nn.BCELoss()

    FYI, the cross-entropy loss is defined as:

    criterion = nn.CrossEntropyLoss()

    6.2 Softmax Function

    Choosing the maximum output for each record using .max(axis=..):

    You would need this functionality to make a prediction for classification from the softmax-output.

    Codes for the training a classification model is as follows:

    Only the training part is presented.

    6.3 Softmax PyTorch

    Weight Visualization for MNIST

    Let's say, we are building an MNIST classification model:

    $$\boldsymbol{Y} = \boldsymbol{X} \boldsymbol{W}$$

    where $\boldsymbol{Y} \in \mathbb{R}^{1 \times 10}$, $\boldsymbol{X} \in \mathbb{R}^{1 \times 784}$, and $\boldsymbol{W} \in \mathbb{R}^{784 \times 10}$. Note that 784 is a number of pixels (28 x 28) and 10 is a number of $y$ labels. Let $i$ and $j$ be a row index and a column index, respectively. Then, each $\boldsymbol{W}_j$ acts like a filter for the input. In fact, we can think of this multi-linear regression model:

    $$ \boldsymbol{Y}_j = \boldsymbol{X} \boldsymbol{W}_j $$

    Then, each $\boldsymbol{W}_j$ can be visualized after reshaped to (28 x 28). The visualization of $\boldsymbol{W}_j$ is:

    Left: $\boldsymbol{W}_j$ before training, Right: $\boldsymbol{W}_j$ after training

    6.3.1 Dataset Class - Part2

    In Section 1.5, we built a custom module for a dataset using the abstract class, DataSet from torch.utils.data. Its structure allowed us to take data one by one so that our memory won't be full. However, we did not learn how to take data by mini-batch. Here's how we can do:

    From the previous code in Section 1.5, some things are added such as shuffle_dataname and Transform. Up to here, we can only take data one by one. What makes the mini-batch sampling possible is DataLoader from torch.utils.data:

    The image data's directory structure looks like:

    Remember that this approach can be used not only for a lot of images but a lot of numerical datasets as well.

    7.1 Neural Networks in One Dimension

    Here's how we can build a simple neural network with one hidden layer with the hidden layer size of 2:

    Structure of the simple neural network

    Using torch.nn.Sequential(..), we can build the neural network as well:

    model = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid(), nn.Linear(2, 1), nn.Sigmoid())

    7.2 Neural Networks More Hidden Neurons

    The training process is presented below. Generally, it is the same as others above.

    The adam optimizer is defined as:

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    7.5 BackPropagation

    Let's say we want to calculate $\frac{\partial a}{\partial x}$. However, we cannot directly calculate it because there is a bridge $z$ between $a$ and $x$. Therefore, it expands to:

    $$ \frac{\partial a}{\partial x} = \frac{\partial a}{\partial z} \frac{\partial z}{\partial x} $$

    This is called a chain rule. This same principal applies to the backpropagation of neural netowrks.

    7.6 Activation Functions

    # Actication Function
    torch.sigmoid(..)
    torch.tanh(..)
    torch.relu(..)
    
    a = torch.relu(z)  # example
    
    # Activation Layer
    sigmoid_layer = nn.Sigmoid()
    tanh_layer = nn.Tanh()
    relu_layer = nn.ReLU()
    
    a = relu_layer(z)  # example

    8.1 Deep Neural Networks

    With the costume module
    With the sequential module

    8.1.2 Deeper Neural Networks

    The main key to building a very deep neural network is nn.ModuleList. It is just like a Python list. It was designed to store any desired number of nn.Module's.

    An example is shown below:

    8.1 Dropout

    self.dropout = nn.Dropout(p)

    One thing you need to keep in mind is a training mode and an evaluation model for the dropout.

    • When you want to train the model, you set model.train().
    • When you want to evaluate (test) the model, you set model.eval()

    A Pseudocode for training using the dropout would look like this:

    model.train()  # dropout = True
    for epoch in range(epochs):
        ...
    
    with torch.no_grad():
        model.eval()  # dropout = False
        ...

    8.3 Neural Network Initialization Weights

    linear = nn.Linaer(in_size, out_size)
    
    # Xavier uniform
    nn.init.xavier_uniform_(linear.weight, nonlinearity="relu")  # it's an in-place function
    
    # He uniform
    nn.init.kaiming_uniform_(linear.weight, nonlinearity="relu")  # default activation func: leaky_relu

    8.5 Batch Normalization

    Similar to nn.Dropout(..) we need to switch between model.train() and model.eval() for nn.BatchNorm1d(..). The training and evaluation would look like this:

    model.train()  # BN for training; the parameters based on a mini-batch are used.
    for epoch in range(epochs):
        ...
    
    with torch.no_grad():
        model.eval()  # BN for evaluation; the parameters based on the population are used.
        ...

    9.1 Convolution

    In PyTorch, we can create a convolution layer as follows:

    conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)

    nn.Conv2d(..) expects the image size to be $(N, C_{in}, H, W)$ where $N$ denotes sample size, $C_{in}$ denotes a number of channels, and $H$ and $W$ denote height and width, respectively.

    Padding

    If we set padding=1, this will add one outer layer of zeros:

    Original image: (4x4), Padded image: (6x6)

    In PyTorch, there is no feature like padding="same" as in Tensorflow. Therefore, we have to calculate the padding size ourselves. The output size of nn.Conv2d(..) is:

    $$ output\_size = floor \left( \frac{input\_size \; - filter\_size \; + \; (2 \: padding)}{stride} + 1 \right) \tag{1}$$

    Then, we can rewrite this formula w.r.t padding:

    $$ padding = floor \left( \frac{output\_size \times stride \; - \; input\_size \; + \; filter\_size \; - \; 1}{2} \right) \tag{2}$$

    To achieve padding="same", we set output_size = input_size.

    For example, we have:

    [correction] stride:1 $\rightarrow$ stride:2
    [correction] stride:1 $\rightarrow$ stride:2

    By $Eq.(2)$, the padding size can be computed:

    $$ padding = \frac{4 \times 2 - 4 + 2 - 1}{2} = 2.5 \rightarrow floor(2.5) = 2 $$

    9.2 Activation Functions and Max Pooling

    Activation Map

    Input (image) -> kernel -> activation map
    PyTorch implementation for conv-activation

    Max Pooling

    max = nn.MaxPool2d(kernel_size, stride)
    x = max(image)
    
    # we can also directly apply it as a function.
    x = torch.max_pool2d(image, stride, kernel_size)
    • It adds non-linearity in addition to the activation map.
    • It helps with reducinging the network size
    • It may extract important features

    Example of Max Pooling

    In terms of extracting important features, look at the following example:

    Two images are different by a bit of horizontal shift, but the outputs from the MaxPooling are the same.

    9.3 Multiple Input and Output Channels

    Multiple output channels = Multiple kernels.

    In Tensorflow, you defined filters=.. to define a number of kernels (= number of output channels). In PyTorch, you directly define out_channels=...

    Multiple output channels = Multiple kernels

    The above example can be defined as:

    conv1 = nn.Conv2d(in_channels=1, out_channels=3, kernel_size=3)

    $\boldsymbol{W}_j$ represents a kernel, and $\boldsymbol{*}$ denotes the convolution operation.
    With an input image #1
    With an input image #2

    Shape of $\boldsymbol{W}$, $\boldsymbol{X}$, and $\boldsymbol{A}$

    where $H_{out}$ and $W_{out}$ differ depending on the padding. As for $(N, C_{out}, H_{out}, W_{out})$, if $N=1$, it is $(1, C_{out}, H_{out}, W_{out})$ which is $n (= C_{out})$ images filtered by $n$ kernels.

    Multiple Input Channels

    Multiple Input and Output Channels

    conv4 = nn.Conv2d(in_channels=2, out_channels=3, kernel_size=3)

    Convolution operation according to the above nn.Conv2d(..)

    Visualizing the CNN's Operation

    Watch this Youtube video

    Visualizing Deep Layers

    People sometimes visualize highly-activated image patch by a certain kernel $j$ where $j \in \{ 1, C_{out} \}$ to present what kind of features the particular kernel captures. The image patch refers to a patch that a kernel can see in the image. In the earlier layers, kernels can see a small patch, and in the later layers, kernels can see a large patch.

    FAQ regarding this:

    Q. I can't understand how these visualizations can be real if images are progressively being convoluted and have less and less resolution throughout the conv-net.
    A. You go backward to find the patch in the original image that is giving maximum response at a certain layer.

    Source: Andrew Ng's course

    9.4 Convolutional Neural Network

    Simple CNN

    Let's build a CNN with two convolutional layers and an output layer.

    1. The first convolutional layer has two output channels; $C_{in}=1, \; C_{out} = 2$.
    2. -> Activation map -> Pooling layer ->
    3. The second convolutional layer has two input channels and one output channel; $C_{in}=2, \; C_{out}=1$.
    4. -> Activation map -> Pooling layer -> Flatten ->
    5. Fully connected (FC) layers -> output

    GPU in PyTorch

    We first need to check if CUDA is available:

    torch.cuda.is_available()

    We can search the current device and a number of available GPUs:

    torch.cuda.current_device()
    torch.cuda.device_count()

    We define our GPU device so that we can send our tensors to it:

    device = torch.device("cuda:0")
    
    or 
    
    device = torch.device(0)

    The following two methods can send tensors to GPU:

    torch.tensor([1,2,3,4]).to(device)
    
    or 
    
    torch.tensor([1,2,3,4], device="cuda:0")

    Creating the CNN on GPU. Almost everything remains the same except the following two things: 1) Sending a model to GPU, 2) Sending training data to GPU:

    1) Sending a model to GPU

    model = CNN().to(device)  // it sends the model to the GPU

    2) Sending training data to GPU

    You don't have to send the testing data to GPU since you don't have to calculate the derivatives and run the weight update.

    Note that if the model uses a recurrent network and it's stateful, then the loss function should be on the GPU (reference).

    criterion = nn.MSELoss().to(device)

    9.5 Torch-Vision Models

    Torch-Vision Models are popular models that have been developed by experts, we can use these pre-trained models to help better classify our own images.

    Example

    Let's use a pre-trained ResNet-18. One thing to note is that we have to check the input scaling for the model. In case of the ResNet18, the image size has to be $(3 \times H \times W)$ where $H \ge 224$ and $W \ge 224$, and the images have to be loaded into a range of $[0, 1]$ and then normalized using $mean = [0.485, 0.456, 0.406]$ and $std = [0.229, 0.224, 0.225]$ (check here).

    Let's import the ResNet18 and take a look at the layers:

    We'll freeze all the layers and overwrite the last layer with our custom FC layer:

    Note that when a layer is created, .requires_grad is True by default.

    When building an optimizer, we provide the parameters that need to be trained only:

    optimizer = torch.optim.Adam([param for param in model.parameters() if param.requires_grad], lr=1e-3)

    The overall training process is the same. One thing to note is setting model.train() and model.eval() since there are BN layers:



    How to Delete and Add a Layer in a (Pretrained) Model (reference: here)

    Let's say you have a pretrained ResNet18:

    import torchvision.models as models
    resnet = models.resnet18(pretrained=True)

    Note that the last layer is a fully connected (FC) layer

    We can create a generator for a list of the layers in resnet by using .children():

    Then, we can create a new Sequential model with all the existing layers except the last layer (This is how we can delete some layer in an existing model):

    We can add a new FC layer as follows:

    If you want to save your memory, you can do all the steps above in one line:

     

     

    Comments