-
Deep Neural Networks with PyTorch, COURSERANote-taking/Courses 2021. 1. 20. 09:02
1.1 Tensors 1D
a = torch.tensor([0,1,2,3,4])
a.dtype
->torch.int64
a.type()
->torch.LongTensor
a = torch.tensor([0., 1., 2., 3., 4.], dtype=torch.int32)
a.dtype = torch.int32
a = torch.FloatTensor([0, 1, 2, 3, 4])
->tensor([0., 1., 2., 3., 4.])
a = a.type(torch.FloatTensor)
a.size()
->torch.Size(5)
a.ndimension()
-> 1a_col = a.view(5, 1) = a.view(-1, 1)
//.view(..)
is the same as.reshape(..)
numpy_array = np.array([0., 1., 2., 3., 4.])
torch_tensor = torch.from_numpy(numpy_array)
//.from_numpy(..)
converts numpy-array to pytorch-tensorback_to_numpy = torch_tensor.numpy()
//.numpy()
converts pytorch-tensor to numpy-arraytorch_to_list = torch_tensor.tolist()
-> [0., 1., 2., 3., 4.] // '.tolist()' converts pytorch-tensor to listtorch_tensor[0]
->tensor(0.)
torch_tensor[1]
->tensor(1.)
torch_tensor[0].item()
-> 0. //.item()
results in a (int/float) valuetorch_tensor[1].item()
-> 1.torch_tensor[0]
= 100 // change the value in a tensortorch_tensor[1:4]
// slicing is the same as other arraryBasic Operations
u = torch.tensor([1., 0.])
v = torch.tensor([0., 1.])
z = u+v
->tensor([1., 1.])
y = torch.tensor([1, 2])
z = 2 * y
->tensor([2, 4])
u = torch.tensor([1, 2])
v = torch.tensor([3, 2])
z = u*v
->tensor([3, 4])
u = torch.tensor([1, 2])
v = torch.tensor([3, 1])
z = torch.dot(u, v)
-> 5u = torch.tensor([1, 2, 3, -1])
z = u+1
->tensor([2, 3, 4, 0])
// broad-castingUniversal Functions
a = torch.tensor([1, -1, 1, -1])
mean_a = a.mean()
//.mean()
b = torch.tensor([1, -2, 3, 4, 5])
max_b = b.max()
//.max()
x = torch.tensor([0, np.pi/2, np.pi])
y = torch.sin(x)
->tensor([sin(0), sin(pi/2), sin(pi)])
=tensor([0, 1, 0])
torch.linspace(-2, 2, steps=5)
->tensor([-2, 1, 0, 1, 2])
1.2 Two-Dimensional Tensors
a = [[11, 12, 13], [21, 22, 23], [31, 32, 33]]
A = torch.tensor(a)
-> 3x3 matrixA.ndimensions()
-> 2A.shape
->torch.Size(3, 3)
A.numel()
-> 9 // #elementsIndexing, slicing, basic operations (e.g. add, subtract, multiply, divide) work the same way as with np.array
A = torch.tensor([[0,1,1], [1,0,1]])
B = torch.tensor([[1,1], [1,1], [-1,1]])
C = torch.mm(A,B)
//mm
denotes the matrix multiplication.
Note that the difference between the dot product and the matrix multiplication is: The dot product is defined between two vectors, and the matrix product is defined between two matrices.1.3 Derivatives in PyTorch
Let's see the following example: We're trying to compute a derivative of $y$ w.r.t $x$:
$$y(x) = x^2$$
If $x=2$, then
$$ y(2) = 2^2 = 4 $$
$$ \frac{dy(x)}{dx} = 2x $$
$$ \frac{dy(2)}{dx} = 2(2) = 4 $$
Let's see how we can compute them with pytorch:
x = torch.tesnor(2., requires_grad=True)
// we want to evaluate a derivative of a function $y$ w.r.t $x$ at $x=2$.requires_grad=True
allows to compute the derivative.y = x**2
-> $2.^2$y.backward()
//backward()
calculates a derivative of $y$.x.grad()
// evaluates a derivative of $y$ at $x=2.$. It's the same as $\left( = \frac{dy(x=2)}{dx} = 4. \right)$.u = torch.tensor(1., requires_grad=True)
v = torch.tensor(2., requires_grad=True)
f = u*v + u**2
// $f(u, v) = uv + u^2$f.backward()
// compute derivativesu.grad
// partial derivative of $f$ w.r.t $u$ $= \frac{\partial f(u,v)}{du}$v.grad
// partial derivative of $f$ w.r.t $v$ $= \frac{\partial f(u,v)}{dv}$1.3 Simple Dataset
Build a Data Set Class and Object
To create our own dataset class, we must import an abstract class called
Dataset
fromfrom torch.utils.data import Dataset
which is an abstract class.__getitem__
allows an instance of class to be indexed and sliced like a list or an array (refer to here).Transform
Almost every dataset needs to be transformed (e.g. normalizing). We usually build a class to handle it. An example is shown below:
__call__
allows an instance of class to be callable (refer to here).Transforms Compose
In many cases, we would like to run several transforms in series. We simply use
from torchvision.transforms import Compose
.Let's say you have two transformation classes:
Then, we define:
from torchvision import transforms data_transform = transforms.Compose([add_mult(), mult()]) x_, y_ = data_transform(data_set[0])
where
data_transform
takes an input and appliesadd_mult()
first, and thenmult()
. After that, it outputs the transformed data.1.5 Dataset
Dataset Class (for a large dataset)
Process of building a dataset for images is similar to the one above, but we do not usually load all images into the memory. We can also apply a similar technique to a large (numerical) dataset.
Further information about taking out a mini-batch is in Section 6.3.1.
Torch Vision Transforms
Torch Vision has a set of widely used pre-build transforms that are used on images. Here's an example:
import torchvision.transforms as transforms transforms.CenterCrop(20) // crop the edges of an image transforsm.ToTensor() // convert to tensor croptensor_data_transform = transforms.Compose([ transforms.CenterCrop(20), transforms.ToTensor() ]) dataset = Dataset(scv_file, data_dir, transforms=croptensor_data_transform) dataset[0][0].shape // -> torch.Size([1, 20, 20])
Torch Vision Datasets
Torch Vision also has pre-built datasets that are commonly used to compare models.
import torchvision.datasets as dsets dataset = dsets.MNIST(root="./data", download=True, transform=transforms.ToTensor())
Linear Regression in 1D - Prediction
Here, we model a linear regression as:
$$ y = b + wx $$
Assuming that we know $b=2$ and $w=-1$, the pytorch implementation is:
b = torch.tensor(2, requires_grad=True) w = torch.tensor(-1, requires_grad=True) def forward(x): "defines a prediction process" y = b + w*x return y x = torch.arange(0, 5).view(-1, 1) yhat = forward(x) // -> [2, 1, 0, -1, -2]^T
There exist pytorch packages and classes for Linear computation.
Custom Modules
Using
model.state_dict()
, you can access and modify the parameter values:2.4 PyTorch Slope
Linear Regression PyTorch
3.1 Stochastic Gradient Descent and the Data Loader
A procedure of the stochastic gradient descent is:
- For $epoch \in \{0,1, \cdots\}$
- For every sample
- Update the parameters
- For every sample
The training is usually unstable due to the fact that variability in every sample (sample size: 1) is too large.
Stochastic Gradient Descent DataLoader
Note that
DataLoader
wraps the instance ofData
to extend its functionality. Ifbatch_size
is larger than 1, it outputs the mini-batch.3.2 Mini-Batch Gradient Descent
A procedure of the mini-batch gradient descent is:
- For $epoch \in \{0,1, \cdots\}$
- For_every mini-batch
- Update the parameters
- For_every mini-batch
If the mini-batch size is the same as the entire sample size, it is just a gradient descent (with a batch).
In PyTorch, you can implement the mini-batch iteration with
trainloader = DataLoader(dataset, batch_size=2**5)
3.3 Optimization in PyTorch
MSE loss function can be defined by using:
criterion = nn.MSELoss()
Various optimizers are defined in
torch.optim
. Also, note thatoptimizer
has.state_dict()
:from torch import optim model = LR(1, 1) # linear regression model optimizer = optim.SGD(model.parameters(), lr= 0.01)
Similar to the model,
optimzer
has.state_dict()
.Using the
DataLoader
,nn.MSELoss
, andoptim.SGD
, the training code can be written as:4.1 Multiple Linear Regression Prediction
Just as a reminder, once we compute a loss function
loss
, $J$, and runloss.backward()
, the partial derivatives of $J$ w.r.t the weights are computed:
$$ \frac{\partial J(b, w_1, w_2)}{\partial b}, \quad \frac{\partial J(b, w_1, w_2)}{\partial w_1}, \quad \frac{\partial J(b, w_1, w_2)}{\partial w_2} $$
and they can be used to update the weights (=optimizer.step()
).4.2 Multiple Output Linear Regression
- #output = 2
model = Linear(input_size=2, output_size=2)
- #output = 3
model = Linear(input_size=2, output_size=3)
The training of the model is basically the same as Section 3.3.
5.0 Linear Classifier and Logistic Regression
A logistic regression model can be defined as:
$$ \hat{y} = \frac{1}{1 + \exp( - z )} $$
$$ z = b + \sum_{i=1}^p w_i x_i $$
where $p$ is a number of features/predictors. The logistic function is basically the same as the sigmoid function.import torch.nn as nn z = torch.arange(-10., 10., 0.1).view(-1, 1) # Approach #1 sig = nn.Sigmoid() yhat = sig(z) # Approach #2 yhat = torch.sigmoid(z)
nn.Sequential
nn.Sequential
is a really fast way to build a model.Let's build a logistic regression model using
nn.Sequential
:model = nn.Sequential(nn.Linear(1,1), nn.Sigmoid()) yhat = model(x)
Custom Module
Binary Cross-Entropy Error - Loss Function
For binary classification models, a binary cross-entropy loss function is commonly used. In PyTorch, you can use:
criterion = nn.BCELoss()
FYI, the cross-entropy loss is defined as:
criterion = nn.CrossEntropyLoss()
6.2 Softmax Function
Choosing the maximum output for each record using
.max(axis=..)
:You would need this functionality to make a prediction for classification from the softmax-output.
Codes for the training a classification model is as follows:
6.3 Softmax PyTorch
Weight Visualization for MNIST
Let's say, we are building an MNIST classification model:
$$\boldsymbol{Y} = \boldsymbol{X} \boldsymbol{W}$$
where $\boldsymbol{Y} \in \mathbb{R}^{1 \times 10}$, $\boldsymbol{X} \in \mathbb{R}^{1 \times 784}$, and $\boldsymbol{W} \in \mathbb{R}^{784 \times 10}$. Note that 784 is a number of pixels (28 x 28) and 10 is a number of $y$ labels. Let $i$ and $j$ be a row index and a column index, respectively. Then, each $\boldsymbol{W}_j$ acts like a filter for the input. In fact, we can think of this multi-linear regression model:
$$ \boldsymbol{Y}_j = \boldsymbol{X} \boldsymbol{W}_j $$
Then, each $\boldsymbol{W}_j$ can be visualized after reshaped to (28 x 28). The visualization of $\boldsymbol{W}_j$ is:
6.3.1 Dataset Class - Part2
In Section 1.5, we built a custom module for a dataset using the abstract class,
DataSet
fromtorch.utils.data
. Its structure allowed us to take data one by one so that our memory won't be full. However, we did not learn how to take data by mini-batch. Here's how we can do:From the previous code in Section 1.5, some things are added such as
shuffle_dataname
andTransform
. Up to here, we can only take data one by one. What makes the mini-batch sampling possible isDataLoader
fromtorch.utils.data
:The image data's directory structure looks like:
Remember that this approach can be used not only for a lot of images but a lot of numerical datasets as well.
7.1 Neural Networks in One Dimension
Here's how we can build a simple neural network with one hidden layer with the hidden layer size of 2:
Using
torch.nn.Sequential(..)
, we can build the neural network as well:model = nn.Sequential(nn.Linear(1, 2), nn.Sigmoid(), nn.Linear(2, 1), nn.Sigmoid())
7.2 Neural Networks More Hidden Neurons
The training process is presented below. Generally, it is the same as others above.
The adam optimizer is defined as:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
7.5 BackPropagation
Let's say we want to calculate $\frac{\partial a}{\partial x}$. However, we cannot directly calculate it because there is a bridge $z$ between $a$ and $x$. Therefore, it expands to:
$$ \frac{\partial a}{\partial x} = \frac{\partial a}{\partial z} \frac{\partial z}{\partial x} $$
This is called a chain rule. This same principal applies to the backpropagation of neural netowrks.
7.6 Activation Functions
# Actication Function torch.sigmoid(..) torch.tanh(..) torch.relu(..) a = torch.relu(z) # example # Activation Layer sigmoid_layer = nn.Sigmoid() tanh_layer = nn.Tanh() relu_layer = nn.ReLU() a = relu_layer(z) # example
8.1 Deep Neural Networks
8.1.2 Deeper Neural Networks
The main key to building a very deep neural network is
nn.ModuleList
. It is just like a Python list. It was designed to store any desired number ofnn.Module
's.An example is shown below:
8.1 Dropout
self.dropout = nn.Dropout(p)
One thing you need to keep in mind is a training mode and an evaluation model for the dropout.
- When you want to train the model, you set
model.train()
. - When you want to evaluate (test) the model, you set
model.eval()
A Pseudocode for training using the dropout would look like this:
model.train() # dropout = True for epoch in range(epochs): ... with torch.no_grad(): model.eval() # dropout = False ...
8.3 Neural Network Initialization Weights
linear = nn.Linaer(in_size, out_size) # Xavier uniform nn.init.xavier_uniform_(linear.weight, nonlinearity="relu") # it's an in-place function # He uniform nn.init.kaiming_uniform_(linear.weight, nonlinearity="relu") # default activation func: leaky_relu
8.5 Batch Normalization
Similar to
nn.Dropout(..)
we need to switch betweenmodel.train()
andmodel.eval()
fornn.BatchNorm1d(..)
. The training and evaluation would look like this:model.train() # BN for training; the parameters based on a mini-batch are used. for epoch in range(epochs): ... with torch.no_grad(): model.eval() # BN for evaluation; the parameters based on the population are used. ...
9.1 Convolution
In PyTorch, we can create a convolution layer as follows:
conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)
nn.Conv2d(..)
expects the image size to be $(N, C_{in}, H, W)$ where $N$ denotes sample size, $C_{in}$ denotes a number of channels, and $H$ and $W$ denote height and width, respectively.Padding
If we set
padding=1
, this will add one outer layer of zeros:In PyTorch, there is no feature like
padding="same"
as in Tensorflow. Therefore, we have to calculate the padding size ourselves. The output size ofnn.Conv2d(..)
is:$$ output\_size = floor \left( \frac{input\_size \; - filter\_size \; + \; (2 \: padding)}{stride} + 1 \right) \tag{1}$$
Then, we can rewrite this formula w.r.t padding:
$$ padding = floor \left( \frac{output\_size \times stride \; - \; input\_size \; + \; filter\_size \; - \; 1}{2} \right) \tag{2}$$
To achieve
padding="same"
, we setoutput_size = input_size
.For example, we have:
By $Eq.(2)$, the padding size can be computed:
$$ padding = \frac{4 \times 2 - 4 + 2 - 1}{2} = 2.5 \rightarrow floor(2.5) = 2 $$
9.2 Activation Functions and Max Pooling
Activation Map
Max Pooling
max = nn.MaxPool2d(kernel_size, stride) x = max(image) # we can also directly apply it as a function. x = torch.max_pool2d(image, stride, kernel_size)
- It adds non-linearity in addition to the activation map.
- It helps with reducinging the network size
- It may extract important features
Example of Max Pooling
In terms of extracting important features, look at the following example:
9.3 Multiple Input and Output Channels
Multiple output channels = Multiple kernels.
In Tensorflow, you defined
filters=..
to define a number of kernels (= number of output channels). In PyTorch, you directly defineout_channels=..
.The above example can be defined as:
conv1 = nn.Conv2d(in_channels=1, out_channels=3, kernel_size=3)
Shape of $\boldsymbol{W}$, $\boldsymbol{X}$, and $\boldsymbol{A}$
where $H_{out}$ and $W_{out}$ differ depending on the padding. As for $(N, C_{out}, H_{out}, W_{out})$, if $N=1$, it is $(1, C_{out}, H_{out}, W_{out})$ which is $n (= C_{out})$ images filtered by $n$ kernels.
Multiple Input Channels
Multiple Input and Output Channels
conv4 = nn.Conv2d(in_channels=2, out_channels=3, kernel_size=3)
Visualizing the CNN's Operation
Watch this Youtube video
Visualizing Deep Layers
People sometimes visualize highly-activated image patch by a certain kernel $j$ where $j \in \{ 1, C_{out} \}$ to present what kind of features the particular kernel captures. The image patch refers to a patch that a kernel can see in the image. In the earlier layers, kernels can see a small patch, and in the later layers, kernels can see a large patch.
FAQ regarding this:
Q. I can't understand how these visualizations can be real if images are progressively being convoluted and have less and less resolution throughout the conv-net.
A. You go backward to find the patch in the original image that is giving maximum response at a certain layer.Source: Andrew Ng's course
9.4 Convolutional Neural Network
Simple CNN
Let's build a CNN with two convolutional layers and an output layer.
- The first convolutional layer has two output channels; $C_{in}=1, \; C_{out} = 2$.
- -> Activation map -> Pooling layer ->
- The second convolutional layer has two input channels and one output channel; $C_{in}=2, \; C_{out}=1$.
- -> Activation map -> Pooling layer -> Flatten ->
- Fully connected (FC) layers -> output
GPU in PyTorch
We first need to check if CUDA is available:
torch.cuda.is_available()
We can search the current device and a number of available GPUs:
torch.cuda.current_device() torch.cuda.device_count()
We define our GPU device so that we can send our tensors to it:
device = torch.device("cuda:0") or device = torch.device(0)
The following two methods can send tensors to GPU:
torch.tensor([1,2,3,4]).to(device) or torch.tensor([1,2,3,4], device="cuda:0")
Creating the CNN on GPU. Almost everything remains the same except the following two things: 1) Sending a model to GPU, 2) Sending training data to GPU:
1) Sending a model to GPU
model = CNN().to(device) // it sends the model to the GPU
2) Sending training data to GPU
You don't have to send the testing data to GPU since you don't have to calculate the derivatives and run the weight update.
Note that if the model uses a recurrent network and it's stateful, then the loss function should be on the GPU (reference).
criterion = nn.MSELoss().to(device)
9.5 Torch-Vision Models
Torch-Vision Models are popular models that have been developed by experts, we can use these pre-trained models to help better classify our own images.
Example
Let's use a pre-trained ResNet-18. One thing to note is that we have to check the input scaling for the model. In case of the ResNet18, the image size has to be $(3 \times H \times W)$ where $H \ge 224$ and $W \ge 224$, and the images have to be loaded into a range of $[0, 1]$ and then normalized using $mean = [0.485, 0.456, 0.406]$ and $std = [0.229, 0.224, 0.225]$ (check here).
Let's import the ResNet18 and take a look at the layers:
We'll freeze all the layers and overwrite the last layer with our custom FC layer:
Note that when a layer is created,
.requires_grad
isTrue
by default.When building an optimizer, we provide the parameters that need to be trained only:
optimizer = torch.optim.Adam([param for param in model.parameters() if param.requires_grad], lr=1e-3)
The overall training process is the same. One thing to note is setting
model.train()
andmodel.eval()
since there are BN layers:
How to Delete and Add a Layer in a (Pretrained) Model (reference: here)
Let's say you have a pretrained ResNet18:
import torchvision.models as models resnet = models.resnet18(pretrained=True)
We can create a generator for a list of the layers in
resnet
by using.children()
:Then, we can create a new Sequential model with all the existing layers except the last layer (This is how we can delete some layer in an existing model):
We can add a new FC layer as follows:
If you want to save your memory, you can do all the steps above in one line:
- For $epoch \in \{0,1, \cdots\}$