ABOUT ME

-

Today: -

Yesterday: -

Total: -

Daesoo Lee's Blog

Deep Neural Networks with PyTorch, COURSERA

Note-taking/Courses 2021. 1. 20. 09:02
1.1 Tensors 1D

a = torch.tensor([0,1,2,3,4])
a.dtype -> torch.int64
a.type() -> torch.LongTensor

a = torch.tensor([0., 1., 2., 3., 4.], dtype=torch.int32)
a.dtype = torch.int32

a = torch.FloatTensor([0, 1, 2, 3, 4]) -> tensor([0., 1., 2., 3., 4.])
a = a.type(torch.FloatTensor)

a.size() -> torch.Size(5)
a.ndimension() -> 1

a_col = a.view(5, 1) = a.view(-1, 1) // .view(..) is the same as .reshape(..)

numpy_array = np.array([0., 1., 2., 3., 4.])
torch_tensor = torch.from_numpy(numpy_array) // .from_numpy(..) converts numpy-array to pytorch-tensor
back_to_numpy = torch_tensor.numpy() // .numpy() converts pytorch-tensor to numpy-array

torch_to_list = torch_tensor.tolist() -> [0., 1., 2., 3., 4.] // '.tolist()' converts pytorch-tensor to list

torch_tensor[0] -> tensor(0.)
torch_tensor[1] -> tensor(1.)

torch_tensor[0].item() -> 0. // .item() results in a (int/float) value
torch_tensor[1].item() -> 1.

torch_tensor[0] = 100 // change the value in a tensor

torch_tensor[1:4] // slicing is the same as other arrary

Basic Operations

u = torch.tensor([1., 0.])
v = torch.tensor([0., 1.])
z = u+v -> tensor([1., 1.])

y = torch.tensor([1, 2])
z = 2 * y -> tensor([2, 4])

u = torch.tensor([1, 2])
v = torch.tensor([3, 2])
z = u*v -> tensor([3, 4])

u = torch.tensor([1, 2])
v = torch.tensor([3, 1])
z = torch.dot(u, v) -> 5

u = torch.tensor([1, 2, 3, -1])
z = u+1 -> tensor([2, 3, 4, 0]) // broad-casting

Universal Functions

a = torch.tensor([1, -1, 1, -1])
mean_a = a.mean() // .mean()

b = torch.tensor([1, -2, 3, 4, 5])
max_b = b.max() // .max()

x = torch.tensor([0, np.pi/2, np.pi])
y = torch.sin(x) -> tensor([sin(0), sin(pi/2), sin(pi)]) = tensor([0, 1, 0])

torch.linspace(-2, 2, steps=5) -> tensor([-2, 1, 0, 1, 2])

1.2 Two-Dimensional Tensors

a = [[11, 12, 13], [21, 22, 23], [31, 32, 33]]
A = torch.tensor(a) -> 3x3 matrix

A.ndimensions() -> 2
A.shape -> torch.Size(3, 3)
A.numel() -> 9 // #elements

Indexing, slicing, basic operations (e.g. add, subtract, multiply, divide) work the same way as with np.array

A = torch.tensor([[0,1,1], [1,0,1]])
B = torch.tensor([[1,1], [1,1], [-1,1]])
C = torch.mm(A,B) // mm denotes the matrix multiplication.
Note that the difference between the dot product and the matrix multiplication is: The dot product is defined between two vectors, and the matrix product is defined between two matrices.

1.3 Derivatives in PyTorch

Let's see the following example: We're trying to compute a derivative of $y$ w.r.t $x$:

$$y(x) = x^2$$

If $x=2$, then

$$ y(2) = 2^2 = 4 $$

$$ \frac{dy(x)}{dx} = 2x $$

$$ \frac{dy(2)}{dx} = 2(2) = 4 $$

Let's see how we can compute them with pytorch:

x = torch.tesnor(2., requires_grad=True) // we want to evaluate a derivative of a function $y$ w.r.t $x$ at $x=2$. requires_grad=True allows to compute the derivative.
y = x**2 -> $2.^2$
y.backward() // backward() calculates a derivative of $y$.
x.grad() // evaluates a derivative of $y$ at $x=2.$. It's the same as $\left( = \frac{dy(x=2)}{dx} = 4. \right)$.

This is how $x$ and $y$ look internally.

u = torch.tensor(1., requires_grad=True)
v = torch.tensor(2., requires_grad=True)
f = u*v + u**2 // $f(u, v) = uv + u^2$
f.backward() // compute derivatives
u.grad // partial derivative of $f$ w.r.t $u$ $= \frac{\partial f(u,v)}{du}$
v.grad // partial derivative of $f$ w.r.t $v$ $= \frac{\partial f(u,v)}{dv}$

1.3 Simple Dataset

Build a Data Set Class and Object

To create our own dataset class, we must import an abstract class called Dataset from from torch.utils.data import Dataset which is an abstract class.

Toy dataset class

__getitem__ allows an instance of class to be indexed and sliced like a list or an array (refer to here).

Transform

Almost every dataset needs to be transformed (e.g. normalizing). We usually build a class to handle it. An example is shown below:

__call__ allows an instance of class to be callable (refer to here).

Transforms Compose

In many cases, we would like to run several transforms in series. We simply use from torchvision.transforms import Compose.

Let's say you have two transformation classes:

Then, we define:

from torchvision import transforms data_transform = transforms.Compose([add_mult(), mult()]) x_, y_ = data_transform(data_set[0])

where data_transform takes an input and applies add_mult() first, and then mult(). After that, it outputs the transformed data.

1.5 Dataset

Dataset Class (for a large dataset)

Process of building a dataset for images is similar to the one above, but we do not usually load all images into the memory. We can also apply a similar technique to a large (numerical) dataset.

Further information about taking out a mini-batch is in Section 6.3.1.

Torch Vision Transforms

Torch Vision has a set of widely used pre-build transforms that are used on images. Here's an example:

import torchvision.transforms as transforms transforms.CenterCrop(20) // crop the edges of an image transforsm.ToTensor() // convert to tensor croptensor_data_transform = transforms.Compose([ transforms.CenterCrop(20), transforms.ToTensor() ]) dataset = Dataset(scv_file, data_dir, transforms=croptensor_data_transform) dataset[0][0].shape // -> torch.Size([1, 20, 20])

Torch Vision Datasets

Torch Vision also has pre-built datasets that are commonly used to compare models.

import torchvision.datasets as dsets dataset = dsets.MNIST(root="./data", download=True, transform=transforms.ToTensor())

Linear Regression in 1D - Prediction

Here, we model a linear regression as:

$$ y = b + wx $$

Assuming that we know $b=2$ and $w=-1$, the pytorch implementation is:

b = torch.tensor(2, requires_grad=True) w = torch.tensor(-1, requires_grad=True) def forward(x): "defines a prediction process" y = b + w*x return y x = torch.arange(0, 5).view(-1, 1) yhat = forward(x) // -> [2, 1, 0, -1, -2]^T

There exist pytorch packages and classes for Linear computation.

Custom Modules

Using model.state_dict(), you can access and modify the parameter values:

2.4 PyTorch Slope

Linear Regression PyTorch

3.1 Stochastic Gradient Descent and the Data Loader

A procedure of the stochastic gradient descent is:

For $epoch \in \{0,1, \cdots\}$

For every sample

Update the parameters

The training is usually unstable due to the fact that variability in every sample (sample size: 1) is too large.

Stochastic Gradient Descent DataLoader

Note that DataLoader wraps the instance of Data to extend its functionality. If batch_size is larger than 1, it outputs the mini-batch.

3.2 Mini-Batch Gradient Descent

A procedure of the mini-batch gradient descent is:

For $epoch \in \{0,1, \cdots\}$

For_every mini-batch

Update the parameters

If the mini-batch size is the same as the entire sample size, it is just a gradient descent (with a batch).

In PyTorch, you can implement the mini-batch iteration with

trainloader = DataLoader(dataset, batch_size=2**5)

3.3 Optimization in PyTorch

MSE loss function can be defined by using:

criterion = nn.MSELoss()

Various optimizers are defined in torch.optim. Also, note that optimizer has .state_dict():

from torch import optim model = LR(1, 1) # linear regression model optimizer = optim.SGD(model.parameters(), lr= 0.01)

Similar to the model, optimzer has .state_dict().

Using the DataLoader, nn.MSELoss, and optim.SGD, the training code can be written as:

Overall flow of the training

4.1 Multiple Linear Regression Prediction

For $Y = b + w_1 X_1 + w_2 X_2$

Just as a reminder, once we compute a loss function loss, $J$, and run loss.backward(), the partial derivatives of $J$ w.r.t the weights are computed:
$$ \frac{\partial J(b, w_1, w_2)}{\partial b}, \quad \frac{\partial J(b, w_1, w_2)}{\partial w_1}, \quad \frac{\partial J(b, w_1, w_2)}{\partial w_2} $$
and they can be used to update the weights (= optimizer.step()).

4.2 Multiple Output Linear Regression

#output = 2

model = Linear(input_size=2, output_size=2)

#output = 3

model = Linear(input_size=2, output_size=3)

The training of the model is basically the same as Section 3.3.

5.0 Linear Classifier and Logistic Regression

A logistic regression model can be defined as:
$$ \hat{y} = \frac{1}{1 + \exp( - z )} $$
$$ z = b + \sum_{i=1}^p w_i x_i $$
where $p$ is a number of features/predictors. The logistic function is basically the same as the sigmoid function.

import torch.nn as nn z = torch.arange(-10., 10., 0.1).view(-1, 1) # Approach #1 sig = nn.Sigmoid() yhat = sig(z) # Approach #2 yhat = torch.sigmoid(z)

nn.Sequential

nn.Sequential is a really fast way to build a model.

Let's build a logistic regression model using nn.Sequential:

model = nn.Sequential(nn.Linear(1,1), nn.Sigmoid()) yhat = model(x)

Internal operation of the model

Custom Module

Building a logistic regression model using the custom module

Binary Cross-Entropy Error - Loss Function

For binary classification models, a binary cross-entropy loss function is commonly used. In PyTorch, you can use:

criterion = nn.BCELoss()

FYI, the cross-entropy loss is defined as:

criterion = nn.CrossEntropyLoss()

6.2 Softmax Function

Choosing the maximum output for each record using .max(axis=..):

You would need this functionality to make a prediction for classification from the softmax-output.

Codes for the training a classification model is as follows:

Only the training part is presented.

6.3 Softmax PyTorch

Weight Visualization for MNIST

Let's say, we are building an MNIST classification model:

$$\boldsymbol{Y} = \boldsymbol{X} \boldsymbol{W}$$

where $\boldsymbol{Y} \in \mathbb{R}^{1 \times 10}$, $\boldsymbol{X} \in \mathbb{R}^{1 \times 784}$, and $\boldsymbol{W} \in \mathbb{R}^{784 \times 10}$. Note that 784 is a number of pixels (28 x 28) and 10 is a number of $y$ labels. Let $i$ and $j$ be a row index and a column index, respectively. Then, each $\boldsymbol{W}_j$ acts like a filter for the input. In fact, we can think of this multi-linear regression model:

$$ \boldsymbol{Y}_j = \boldsymbol{X} \boldsymbol{W}_j $$

Then, each $\boldsymbol{W}_j$ can be visualized after reshaped to (28 x 28). The visualization of $\boldsymbol{W}_j$ is:

Left: $\boldsymbol{W}_j$ before training, Right: $\boldsymbol{W}_j$ after training

6.3.1 Dataset Class - Part2

In Section 1.5, we built a custom module for a dataset using the abstract class, DataSet from torch.utils.data. Its structure allowed us to take data one by one so that our memory won't be full. However, we did not learn how to take data by mini-batch. Here's how we can do:

From the previous code in Section 1.5, some things are added such as shuffle_dataname and Transform. Up to here, we can only take data one by one. What makes the mini-batch sampling possible is DataLoader from torch.utils.data:

The image data's directory structure looks like:

Remember that this approach can be used not only for a lot of images but a lot of numerical datasets as well.

7.1 Neural Networks in One Dimension

Here's how we can build a simple neural network with one hidden layer with the hidden layer size of 2:

Structure of the simple neural network

Using torch.nn.Sequential(..), we can build the neural network as well:

model = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid(), nn.Linear(2, 1), nn.Sigmoid())

7.2 Neural Networks More Hidden Neurons

The training process is presented below. Generally, it is the same as others above.

The adam optimizer is defined as:

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

7.5 BackPropagation

Let's say we want to calculate $\frac{\partial a}{\partial x}$. However, we cannot directly calculate it because there is a bridge $z$ between $a$ and $x$. Therefore, it expands to:

$$ \frac{\partial a}{\partial x} = \frac{\partial a}{\partial z} \frac{\partial z}{\partial x} $$

This is called a chain rule. This same principal applies to the backpropagation of neural netowrks.

7.6 Activation Functions

# Actication Function torch.sigmoid(..) torch.tanh(..) torch.relu(..) a = torch.relu(z) # example # Activation Layer sigmoid_layer = nn.Sigmoid() tanh_layer = nn.Tanh() relu_layer = nn.ReLU() a = relu_layer(z) # example

8.1 Deep Neural Networks

With the costume module

With the sequential module

8.1.2 Deeper Neural Networks

The main key to building a very deep neural network is nn.ModuleList. It is just like a Python list. It was designed to store any desired number of nn.Module's.

An example is shown below:

8.1 Dropout

self.dropout = nn.Dropout(p)

One thing you need to keep in mind is a training mode and an evaluation model for the dropout.

When you want to train the model, you set model.train().

When you want to evaluate (test) the model, you set model.eval()

A Pseudocode for training using the dropout would look like this:

model.train() # dropout = True for epoch in range(epochs): ... with torch.no_grad(): model.eval() # dropout = False ...

8.3 Neural Network Initialization Weights

linear = nn.Linaer(in_size, out_size) # Xavier uniform nn.init.xavier_uniform_(linear.weight, nonlinearity="relu") # it's an in-place function # He uniform nn.init.kaiming_uniform_(linear.weight, nonlinearity="relu") # default activation func: leaky_relu

8.5 Batch Normalization

Similar to nn.Dropout(..) we need to switch between model.train() and model.eval() for nn.BatchNorm1d(..). The training and evaluation would look like this:

model.train() # BN for training; the parameters based on a mini-batch are used. for epoch in range(epochs): ... with torch.no_grad(): model.eval() # BN for evaluation; the parameters based on the population are used. ...

9.1 Convolution

In PyTorch, we can create a convolution layer as follows:

conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)

nn.Conv2d(..) expects the image size to be $(N, C_{in}, H, W)$ where $N$ denotes sample size, $C_{in}$ denotes a number of channels, and $H$ and $W$ denote height and width, respectively.

Padding

If we set padding=1, this will add one outer layer of zeros:

Original image: (4x4), Padded image: (6x6)

In PyTorch, there is no feature like padding="same" as in Tensorflow. Therefore, we have to calculate the padding size ourselves. The output size of nn.Conv2d(..) is:

$$ output\_size = floor \left( \frac{input\_size \; - filter\_size \; + \; (2 \: padding)}{stride} + 1 \right) \tag{1}$$

Then, we can rewrite this formula w.r.t padding:

$$ padding = floor \left( \frac{output\_size \times stride \; - \; input\_size \; + \; filter\_size \; - \; 1}{2} \right) \tag{2}$$

To achieve padding="same", we set output_size = input_size.

For example, we have:

[correction] stride:1 $\rightarrow$ stride:2

[correction] stride:1 $\rightarrow$ stride:2

By $Eq.(2)$, the padding size can be computed:

$$ padding = \frac{4 \times 2 - 4 + 2 - 1}{2} = 2.5 \rightarrow floor(2.5) = 2 $$

9.2 Activation Functions and Max Pooling

Activation Map

Input (image) -> kernel -> activation map

PyTorch implementation for conv-activation

Max Pooling

max = nn.MaxPool2d(kernel_size, stride) x = max(image) # we can also directly apply it as a function. x = torch.max_pool2d(image, stride, kernel_size)

It adds non-linearity in addition to the activation map.

It helps with reducinging the network size

It may extract important features

Example of Max Pooling

In terms of extracting important features, look at the following example:

Two images are different by a bit of horizontal shift, but the outputs from the MaxPooling are the same.

9.3 Multiple Input and Output Channels

Multiple output channels = Multiple kernels.

In Tensorflow, you defined filters=.. to define a number of kernels (= number of output channels). In PyTorch, you directly define out_channels=...

Multiple output channels = Multiple kernels

The above example can be defined as:

conv1 = nn.Conv2d(in_channels=1, out_channels=3, kernel_size=3)

$\boldsymbol{W}_j$ represents a kernel, and $\boldsymbol{*}$ denotes the convolution operation.

With an input image #1

With an input image #2

Shape of $\boldsymbol{W}$, $\boldsymbol{X}$, and $\boldsymbol{A}$

where $H_{out}$ and $W_{out}$ differ depending on the padding. As for $(N, C_{out}, H_{out}, W_{out})$, if $N=1$, it is $(1, C_{out}, H_{out}, W_{out})$ which is $n (= C_{out})$ images filtered by $n$ kernels.

Multiple Input Channels

Multiple Input and Output Channels

conv4 = nn.Conv2d(in_channels=2, out_channels=3, kernel_size=3)

Convolution operation according to the above nn.Conv2d(..)

Visualizing the CNN's Operation

Watch this Youtube video

Visualizing Deep Layers

People sometimes visualize highly-activated image patch by a certain kernel $j$ where $j \in \{ 1, C_{out} \}$ to present what kind of features the particular kernel captures. The image patch refers to a patch that a kernel can see in the image. In the earlier layers, kernels can see a small patch, and in the later layers, kernels can see a large patch.

FAQ regarding this:

Q. I can't understand how these visualizations can be real if images are progressively being convoluted and have less and less resolution throughout the conv-net.
A. You go backward to find the patch in the original image that is giving maximum response at a certain layer.

Source: Andrew Ng's course

9.4 Convolutional Neural Network

Simple CNN

Let's build a CNN with two convolutional layers and an output layer.

The first convolutional layer has two output channels; $C_{in}=1, \; C_{out} = 2$.

-> Activation map -> Pooling layer ->

The second convolutional layer has two input channels and one output channel; $C_{in}=2, \; C_{out}=1$.

-> Activation map -> Pooling layer -> Flatten ->

Fully connected (FC) layers -> output

GPU in PyTorch

We first need to check if CUDA is available:

torch.cuda.is_available()

We can search the current device and a number of available GPUs:

torch.cuda.current_device() torch.cuda.device_count()

We define our GPU device so that we can send our tensors to it:

device = torch.device("cuda:0") or device = torch.device(0)

The following two methods can send tensors to GPU:

torch.tensor([1,2,3,4]).to(device) or torch.tensor([1,2,3,4], device="cuda:0")

Creating the CNN on GPU. Almost everything remains the same except the following two things: 1) Sending a model to GPU, 2) Sending training data to GPU:

1) Sending a model to GPU

model = CNN().to(device) // it sends the model to the GPU

2) Sending training data to GPU

You don't have to send the testing data to GPU since you don't have to calculate the derivatives and run the weight update.

Note that if the model uses a recurrent network and it's stateful, then the loss function should be on the GPU (reference).

criterion = nn.MSELoss().to(device)

9.5 Torch-Vision Models

Torch-Vision Models are popular models that have been developed by experts, we can use these pre-trained models to help better classify our own images.

Example

Let's use a pre-trained ResNet-18. One thing to note is that we have to check the input scaling for the model. In case of the ResNet18, the image size has to be $(3 \times H \times W)$ where $H \ge 224$ and $W \ge 224$, and the images have to be loaded into a range of $[0, 1]$ and then normalized using $mean = [0.485, 0.456, 0.406]$ and $std = [0.229, 0.224, 0.225]$ (check here).

Let's import the ResNet18 and take a look at the layers:

We'll freeze all the layers and overwrite the last layer with our custom FC layer:

Note that when a layer is created, .requires_grad is True by default.

When building an optimizer, we provide the parameters that need to be trained only:

optimizer = torch.optim.Adam([param for param in model.parameters() if param.requires_grad], lr=1e-3)

The overall training process is the same. One thing to note is setting model.train() and model.eval() since there are BN layers:

How to Delete and Add a Layer in a (Pretrained) Model (reference: here)

Let's say you have a pretrained ResNet18:

import torchvision.models as models resnet = models.resnet18(pretrained=True)

Note that the last layer is a fully connected (FC) layer

We can create a generator for a list of the layers in resnet by using .children():

Then, we can create a new Sequential model with all the existing layers except the last layer (This is how we can delete some layer in an existing model):

We can add a new FC layer as follows:

If you want to save your memory, you can do all the steps above in one line:
저작자표시 (새창열림)
Comments

Popular Posts

ABOUT ME

Machine Learning Engineer / Data Scientist daesoolee2601@gmail.com

LINK

My Github

ADMIN

티스토리툴바