Masked Autoencoders Are Scalable Vision Learners (MAE)

Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

Deep Learning study

Masked Autoencoders Are Scalable Vision Learners (MAE) 본문

AI/papers

Masked Autoencoders Are Scalable Vision Learners (MAE)

illinaire 2025. 4. 18. 13:07

Masked Autoencoders Are Scalable Vision Learners (MAE) 논문 심층 해설

포스트 요약: Kaiming He et al. (2022)의 MAE는 Vision Transformer(ViT)를 기반으로, 입력 이미지 패치의 75%를 무작위로 마스킹하고 나머지 25%만 인코더에 입력해 latent 표현을 학습하는 자기지도 학습 기법입니다. 본 해설에서는 논문 양식(Abstract–Introduction–Related Work–Method–Experiments–Discussion–Conclusion)을 따라, 모든 수식 유도·세부 구현·하이퍼파라미터 튜닝 팁·실험 결과를 심층 분석합니다.

Abstract

This paper presents Masked Autoencoders (MAE), a simple and scalable self‑supervised learning framework for vision. MAE randomly masks out 75% of image patches, feeds the remaining visible patches through a lightweight ViT encoder, and reconstructs the masked patches via a shallow ViT decoder. On ImageNet, MAE achieves 83.6% Top‑1 accuracy with ViT‑Base, surpassing supervised ViT baselines while using only 1/15th compute.

1. Introduction

Vision Transformer(ViT)[1] 기반 모델은 이미지 분류·검출·분할 등에서 탁월하지만, 대량의 라벨 데이터에 의존합니다. Masked Autoencoders(MAE)[2]는 BERT 스타일의 마스킹을 ViT에 적용, 라벨 없이도 강력한 표현 학습을 가능하게 합니다. 이로써 자기지도 학습의 새로운 표준을 제시합니다.

2. Related Work

2.1 Contrastive Methods

SimCLR[3]: 인접한 augmentation 뷰 간 코사인 유사도 학습 (InfoNCE loss).
MoCo[4]: momentum encoder와 큐(queue) 기반으로 대규모 negative 샘플 사용.

2.2 Reconstruction Methods

Autoencoder[5]: 입력 전체를 압축 후 복원, ViT와 결합 시 효율 문제.
BEiT[6]: discrete token 예측 기반 masking.
MAE는 원시 패치 픽셀(x_i) 단위 복원으로 단순·효율적 접근을 취합니다.

3. Method

3.1 Architecture Overview

MAE는 Encoder–Decoder 형태로 구성되며, 학습 시에만 Decoder를 활용해 효율을 극대화합니다.

Patch Embedding & Masking 이미지를 \(16\times16\) 크기 패치로 분할해 \(N\)개 패치로 전환. 이 중 \(\alpha=75\%\)를 uniform하게 마스킹, 나머지 \(25\%\)만 ViT Encoder 입력.
ViT Encoder 선택된 패치 + positional embedding을 더해 \(L_E\)층 Transformer 인코더 통과.
ViT Decoder Encoder 출력에 mask token과 positional encoding을 재삽입해 \(L_D\)층 Transformer 디코더로 전체 패치 복원.
Reconstruction Head 디코더 출력에 linear layer를 적용해 원본 패치 크기 복원, MSE 손실로 학습.

Transformer Attention — Figure 1. Vision Transformer 기반 Encoder–Decoder 구조 (출처: Wikimedia Commons, Public Domain)

3.2 Masking Strategy

마스킹은 uniform sampling으로, 제거된 패치 인덱스 \(\mathcal{M}\)는

\(\displaystyle \mathcal{M} = \mathrm{UniformSample}(\{1,\dots,N\}, \alpha N)\)

나머지 visible set: \(\mathcal{V} = \{1,\dots,N\}\setminus\mathcal{M}\).

3.3 Loss Function

재구성 손실은 mask된 패치에 대해서만 MSE로 계산:

\(\displaystyle \mathcal{L}_{\mathrm{MAE}} = \frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}} \|x_i - \hat{x}_i\|^2\)

4. Experiments

4.1 Setup

Dataset: ImageNet‑1K (1.28M train, 50K val)
Backbone: ViT‑Base/16 (\(L_E=12,d=768\)), ViT‑Large/16 (\(L_E=24\))
Decoder Depth: \(L_D=8\) layers
Training – Base: 800 epochs, lr=1.5e‑4, warm‑up 40 epochs, cosine decay – Large: 1600 epochs, lr=1e‑4, warm‑up 40 epochs, cosine decay – Optimizer: AdamW, weight decay=0.05
Mask Ratio: 75%
Compute: 64×A100 GPUs

4.2 Main Results

Model	Top‑1 Acc.	Epochs	FLOPs/Image
ViT‑Base/16 (supervised)	77.9%	—	17.6 ×10⁹
MAE ViT‑Base/16	83.6%	800	1.2 ×10⁹
MAE ViT‑Large/16	85.9%	1600	3.6 ×10⁹

4.3 Ablation Studies

Mask Ratio: 60%→90% 실험; 75% 최적
Decoder Depth: 4→8→12층 실험; 8층 최적
Visible Sampling: random vs block; random 우수

5. Discussion

MAE는 Encoder-only 추론 시 visible 패치만 처리해 연산·메모리 비용을 획기적으로 절감합니다. 단순 MSE 복원 손실만으로 contrastive 방식에 버금가는 표현을 학습하며, 대규모 unlabeled 데이터 적용 시 더 큰 성능 향상이 기대됩니다.

6. Conclusion

본 심층 해설에서는 MAE 논문의 모든 세부를 논문 양식에 맞춰 다루었습니다. MAE는 자기지도 시각 학습의 새로운 기준을 제시하며, 이후 연구에서 마스크 비율 최적화, multi-modal 확장, 경량화 백본 적용 등이 유망합니다.

References

Dosovitskiy, A. et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR. [PDF]
He, K. et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR. [PDF]
Chen, T. et al. (2020). SimCLR: A Simple Framework for Contrastive Learning of Visual Representations. ICML. [PDF]
He, K. et al. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. CVPR. [PDF]
Bao, H. et al. (2022). BEiT: Training Vision Transformers by Masked Image Modeling. ICLR. [PDF]

저작자표시 비영리 변경금지 (새창열림)

'AI > papers' 카테고리의 다른 글

Segment Anything Model (SAM) 심층 해설 (0)	2025.04.18
Interleaved‑Modal Chain‑of‑Thought (ICoT) 논문 심층 해설 (0)	2025.04.18
SAM 2: Promptable Segmentation in Images & Videos 심층 분석 (0)	2025.04.18
Big Bird: Transformers for Longer Sequences 논문 요약 (0)	2025.04.17
Native Sparse Attention : Hardware-Aligned and NativelyTrainable Sparse Attention (deepseek) (0)	2025.04.07

'AI/papers' Related Articles

Comments

Deep Learning study

Masked Autoencoders Are Scalable Vision Learners (MAE) 본문

Masked Autoencoders Are Scalable Vision Learners (MAE)

Masked Autoencoders Are Scalable Vision Learners (MAE) 논문 심층 해설

Abstract

1. Introduction

2. Related Work

2.1 Contrastive Methods

2.2 Reconstruction Methods

3. Method

3.1 Architecture Overview

3.2 Masking Strategy

3.3 Loss Function

4. Experiments

4.1 Setup

4.2 Main Results

4.3 Ablation Studies

5. Discussion

6. Conclusion

References

'AI > papers' 카테고리의 다른 글

티스토리툴바