GPU 메모리·연산 효율 최적화 완전 가이드

Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

Deep Learning study

GPU 메모리·연산 효율 최적화 완전 가이드 본문

AI/Deep learning 을 위한 지식

GPU 메모리·연산 효율 최적화 완전 가이드

illinaire 2025. 4. 18. 12:10

GPU 메모리·연산 효율 최적화 완전 가이드

포스트 요약: 대규모 딥러닝 모델을 빠르고 안정적으로 학습·추론하기 위해 필수적인 GPU 메모리 관리와 연산 최적화 기법을 심층적으로 다룹니다. Mixed Precision, Gradient Checkpointing, Profiling, DataLoader 튜닝 등 실무 팁과 코드 예제를 포함합니다.

1. GPU 메모리 모델링

전체 GPU 메모리 사용량은 주로 파라미터, 활성화(activations), 옵티마이저 상태(예: momentum 버퍼)로 구성됩니다. 대략적으로:


# 메모리 사용량 예측 (bytes)
M ≈ P × D_param + B × A × D_act + O × D_opt

\(P\): 파라미터 수, \(D_{param}\): 파라미터 dtype 크기 (FP32=4바이트)
\(B\): 배치 크기, \(A\): 활성화 크기, \(D_{act}\): 활성화 dtype 크기
\(O\): 옵티마이저 상태 수 (예: AdamW는 파라미터당 2배), \(D_{opt}\): 옵티마이저 dtype 크기

이를 통해 배치 크기와 모델 크기 조합을 사전에 계산해 OOM(Out‑of‑Memory)을 방지할 수 있습니다.

2. Mixed Precision 학습

2.1 개념

FP32 대신 FP16 연산을 활용해 메모리와 대역폭을 절반 가까이 절약하고, Tensor Core를 이용해 연산 속도를 크게 향상합니다.^[1]

2.2 구현 및 코드

import torch
from torch.cuda.amp import autocast, GradScaler

model, optimizer = MyModel().cuda(), torch.optim.Adam(...)
scaler = GradScaler()

for data, target in loader:
    data, target = data.cuda(), target.cuda()
    optimizer.zero_grad()
    with autocast():
        output = model(data)
        loss = criterion(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

autocast(): 자동으로 연산별 정밀도 선택
GradScaler: 언더플로우/오버플로우 방지용 스케일 관리

3. Gradient Checkpointing

3.1 개념

일부 중간 활성화를 저장하지 않고, backward 시점에 재계산하여 메모리를 절약하는 기법입니다. 계산량이 증가하지만, 큰 모델 학습을 가능케 합니다.^[2]

3.2 코드 예시

import torch.utils.checkpoint as cp
import torch.nn as nn

class BigModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Sequential(...)
        self.layer2 = nn.Sequential(...)
    def forward(self, x):
        x = cp.checkpoint(self.layer1, x)
        x = cp.checkpoint(self.layer2, x)
        return x

4. DataLoader 최적화

데이터 읽기·전처리 병목을 완화하면 전체 학습 시간이 크게 단축됩니다.

옵션	설명
`num_workers`	병렬 워커 프로세스 수 (CPU 코어 수 대비 2~4배 권장)
`pin_memory=True`	페이지 잠금 메모리로 GPU 전송 속도 향상
`persistent_workers=True`	에폭 간 워커 유지로 오버헤드 감소 (PyTorch ≥1.7)
`prefetch_factor`	각 워커 당 미리 로드할 배치 수 (기본 2)

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=8,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=4
)

5. 프로파일링 & 모니터링

5.1 PyTorch 프로파일러

import torch.profiler as profiler

with profiler.profile(
    schedule=profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=profiler.tensorboard_trace_handler('log_dir'),
    record_shapes=True
) as prof:
    for step, (data, _) in enumerate(loader):
        if step >= 5: break
        model(data.cuda())
        prof.step()

– TensorBoard에서 연산별 시간·메모리 사용량 시각화 가능

5.2 실시간 GPU 모니터링

# 학습 구간 후
torch.cuda.reset_peak_memory_stats()
# forward/backward 수행
peak = torch.cuda.max_memory_allocated() / (1024**2)
print(f"Peak GPU Memory: {peak:.1f} MB")

GPU 프로파일링 예시 — 예시: TensorBoard Profiler에서 본 연산별 메모리 사용량 (출처: 자체 실험)

6. 복합 전략 & 고려 사항

Mixed Precision + Checkpointing 병행: 메모리 절감 극대화
Gradient Accumulation: 작은 배치를 여러 단계 누적
메모리 단편화(fragmentation) 방지: torch.cuda.empty_cache()를 주기적 호출
학습률 스케줄러 사용: 큰 배치와 Mixed Precision 조합 시 LR warm‑up 권장

7. 결론 및 다음 단계

프로파일러 기반 병목 분석 및 최적화
전이 학습/대규모 모델에도 적용해 확장성 검증
하이브리드 백엔드(CUDA/ROCm) 비교 실험

참고 문헌

Micikevicius, P. et al. (2018). Mixed Precision Training. ICLR.
Chen, T. et al. (2016). Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174.
Paszke, A. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS.

저작자표시 비영리 변경금지 (새창열림)

'AI > Deep learning 을 위한 지식' 카테고리의 다른 글

컨볼루션 신경망(CNN) 심층 분석: 수식·구현·최신 트릭 (0)	2025.04.18
배치 정규화(Batch Normalization) 심층 분석: 수학적 유도부터 구현·튜닝까지 (0)	2025.04.18
전이 학습(Transfer Learning) 심층 가이드: 이론부터 도메인 적응까지 (0)	2025.04.18
과적합 vs 과소적합 심층 분석: 이론, 진단, 해결 전략 (0)	2025.04.18
학습률 스케줄러 심층 가이드: 수학적 원리부터 실전 적용까지 (1)	2025.04.18

'AI/Deep learning 을 위한 지식' Related Articles

Comments

Deep Learning study

GPU 메모리·연산 효율 최적화 완전 가이드 본문

GPU 메모리·연산 효율 최적화 완전 가이드

GPU 메모리·연산 효율 최적화 완전 가이드

1. GPU 메모리 모델링

2. Mixed Precision 학습

2.1 개념

2.2 구현 및 코드

3. Gradient Checkpointing

3.1 개념

3.2 코드 예시

4. DataLoader 최적화

5. 프로파일링 & 모니터링

5.1 PyTorch 프로파일러

5.2 실시간 GPU 모니터링

6. 복합 전략 & 고려 사항

7. 결론 및 다음 단계

참고 문헌

'AI > Deep learning 을 위한 지식' 카테고리의 다른 글

티스토리툴바