Self-Supervised Learning

이전 포스트

[Machine Learning/Unsupervised Learning] - Self-Supervised Learning - SimCLR (1)

Data augmentation for Contrastive representation learning

다양한 transformation을 통한 이미지 augmentation은 supervised/self-supervised/semi-supervised 를 막론하고 다양한 분야에서 사용되지만 이를 contrastive learning에 적용하기 위해서는 특수한 architecture나 과정을 거쳐야 했습니다. SimCLR에서는 random cropping을 통해 특수한 구조 없이도 augmentation의 다양한 view를 반영하는데 성공했습니다. 밑의 그림과 같이 random cropping을 통해 global/local view, neighbor view를 자연스럽게 대조 예측 할 수 있습니다.

Composition of data augmentation operation is crucial for learning good representations

Augmentation을 위한 transformation으로는 크게 1) cropping/resizing, flipping, rotation, cutout 등의 spatial하거나 geometric한 transformation과 2) color distortion, Gaussian blur, Sobel filtering 등의 appearance transformation이 있습니다.

각 augmentation 방법의 효과를 알아보기 위해 augmentation을 각각 사용하거나 아니면 합쳐서 (aug 1을 적용한 후 aug 2 적용) contrastive learning을 수행한 이후에 linear evaluation을 통한 ImageNet 성능을 알아봅니다.

ImageNet의 이미지는 서로 다른 사이즈를 가지므로 일정한 사이즈를 가진 입력을 만들기 위해 항상 cropping/resizing을 수행하여야 합니다. 이로 인해 다른 transformation의 효과를 파악하기 어려워서 하나의 데이터에서 적용하는 두 개의 transformation 에 대해 cropping/resizing을 모두 적용하고 하나의 브랜치에 대해서만 목표로 한 transformation을 적용합니다. Augmentation에 대한 linear evaluation 결과는 다음과 같습니다.

Figure 5에서 대각선 부분인 single transformation의 수치를 보면 off-diagonal 부분의 augmentation composition에 비해 성능이 더 낮습니다. 즉, single transformation은 positive pair를 가려내는 데는 더 쉽겠지만 좋은 representation을 학습하기에는 부족하다는 것이죠.

또한 Figure 5에서 압도적으로 눈에 띄는 부분은 random cropping과 random color distortion의 조합입니다. 특히, color distortion은 밑의 Figure 6과 같이 색깔 분포를 매우 다르게 바꿔 contrastive prediction을 단순하지 않게 바꿔 더 일반적인 feature를 학습하도록 합니다.

Contrastive learning needs stronger data augmentation than supervised learning

Color augmentation의 중요도를 알아보기 위해 color distortion의 파라미터를 조절해가며 실험합니다. Table 1 은 cropping/resizing + color distortion의 강도에 따른 SimCLR 성능과 이를 augmentation으로 사용한 supervised 모델의 성능 비교입니다. Color distortion의 강도가 강할 수록 SimCLR 성능이 높아지는 것을 볼 수 있으며 supervised 모델은 color distortion의 강도에 따라 크게 차이가 없는 것을 알 수 있습니다. 결론적으로 unsupervised contrastive learning은 더 강한 data augmentation으로부터 성능 이득을 본다는 것입니다.

Architectures for encoder and head

Unsupervised contrastive learning benefits from bigger models

Figure 7을 볼 때 당연하게도 모델의 깊이와 너비가 늘수록 linear evaluation 성능이 더 좋아지는 것을 알 수 있습니다. 여기서 주목할 점은 supervised의 unsupervised의 linear evaluation 차이가 모델이 커질 수록 줄어든다는 점이고 여기로부터 unsupervised learning이 더 큰 모델로부터 얻는 성능 이득이 supervised에 크다는 것을 생각할 수 있습니다.

A nonlinear projection head improves the representation quality of the layer before it

Base encoder $f$에 대해 바로 contrastive loss를 적용하지 않고 projection head $g$를 거친 이후에 적용하는 것이 어떤 효과가 있는지 살펴봅니다. Figure 8은 $g$의 구성과 차원에 따른 성능을 보여주고 있는데, non-linear projection이 linear or none 에 비해 출력 차원에 상관없이 더 효과가 있는 것을 알 수 있습니다.

$h$에 대한 linear evaluation 결과가 $z$에 대한 결과보다 좋았는데, 이는 projection head 전의 hidden layer가 더 좋은 representation을 포함하고 있다는 것을 알 수 있습니다. 이는 contrastive loss로 인해 $g$는 data transformation에 invariant하게 학습되다보니 데이터의 중요한 정보를 잃어버리기 때문으로 추측됩니다. 따라서 오히려 $g$의 학습으로 $h$에는 중요한 정보가 유지될 수 있다는 것이죠. Table 3은 이 추측에 대한 실험 결과를 보여주며, $h$가 어떠한 transformation이 적용되었는지에 대한 정보를 $g(h)$에 비해 더 많이 가지고 있음을 알 수 있습니다.

또한, Figure B.3은 $z=Wh$에서의 $W$의 eigenvalue의 분포를 나타내며 대부분의 eigenvalue가 매우 낮은 것을 알 수 있습니다. 즉, $h$에서 projection 되었을 때, $W$가 low-rank를 가지므로 $h$의 정보가 많이 손실된다는 것이죠. Figure B.4는 랜덤하게 선택된 10개의 클래스에 대해 t-SNE 시각화를 한 것으로 $h$가 $g(h)$에 비해 클래스 별로 잘 분리된 것을 볼 수 있습니다.

Loss functions and batch size

Normalized cross entropy loss with adjustable temperature works better than alternatives

SimCLR의 NT-Xent 이외의 다른 loss와의 비교를 수행합니다. NT-Xent를 제외한 나머지 loss 들은 negative samples에 대한 상대적인 어려움 (hardness)를 고려하지 않습니다. 따라서 다른 loss 들은 semi-hard negative mining, positive와 비슷하면서 구분하기 어려운 negative samples를 따로 추출해서 loss 에 적용하여야 합니다. Semi-hard negative mining을 적용하더라도 어려운 negative samples들을 가중치 별로 학습하는 NT-Xent에 비해 성능이 낮습니다.

또한, Table 5는 cosine-similarity에서 사용되는 $l_2$ normalization의 여부와 temparature 파라미터 $\tau$의 효과를 보여줍니다. $l_2$ normalization에서 contrastive prediction 의 성능은 높으나 representation의 결과는 좋지 않음을 알 수 있습니다.

Contrastive learning benifits from larger batch sizes and longer training

Figure 9는 배치 사이즈와 epoch에 따른 성능을 보여줍니다. 100 epoch정도로 작을 때에는 큰 배치 사이즈가 성능 향상에 결정적임을 알 수 있고 training이 길수록 배치 사이즈에 따른 성능 차이가 감소합을 볼 수 있습니다. 이는 contrastive learning 특성 상 배치 사이즈가 클 수록 더 많은 negative examples을 얻을 수 있어 상대적으로 더 쉽게 수렴하기 때문입니다.

Experiments

Linear evaluation

Table 6은 contrastive learning으로 학습한 representation에 대한 linear evaluation 결과입니다. ResNet-50에 대해 channel width를 1,2,4 배씩 늘리면서 실험합니다. 기존의 다른 방법들에 비해 더 좋은 성능을 확인할 수 있습니다.

Semi-supervised learning

ImageNet의 라벨 데이터에서 1%, 10% 씩 클래스가 균등하게 추출하여 기존 base network에서 fine-tuning 하여 성능 향상을 확인합니다.

Transfer learning

마지막으로 self-supervised representation을 transfer learning에 적용했을 때 성능 향상을 확인합니다. 1) 이때, ImageNet에 학습한 representation을 가지고 새로운 데이터셋에 linear classifier를 학습시키는 linear evaluation 방법과 2) fine-tuning 방법으로 transfer learning 성능을 확인합니다.

Table 8을 보면 fine-tuning 했을 때 5개의 데이터 셋에 대해 supervised보다 더 좋은 성능을 거두었습니다.

홍머스 정리

Temperature 파라미터 $\tau$에 따른 효과...?
Data feature extraction을 위한 self-supervision...
Contrastive!

참조

A Simple Framework for Contrastive Learning of Visual Representations

'Machine Learning Tasks > Self-Supervised Learning' 카테고리의 다른 글

Self-Supervised Learning - SimCLRv2 (2) (0)	2021.04.09
Self-Supervised Learning - SimCLRv2 (1) (0)	2021.04.03
Self-Supervised Learning - BYOL (2) (0)	2021.04.03
Self-Supervised Learning - BYOL (1) (0)	2021.03.28
Self-Supervised Learning - SimCLR (1) (0)	2021.03.23

홍러닝

Self-Supervised Learning - SimCLR (2)