译（五十七）-Pytorch避免’CUDA out of memory’

MWHLS • 2022/04/23 pm2:32 • Python, Pytorch, 计算机语言

已阅： 169

stackoverflow热门问题目录
如有翻译问题欢迎评论指出，谢谢。

如何避免PyTorch的'CUDA out of memory'

How to avoid CUDA out of memory in PyTorch

如何避免PyTorch的'CUDA out of memory'

voilalex asked:
- 下面的报错对于缺少 GPU 资源的 PyTorch 用户们很常见：
- ```
RuntimeError: CUDA out of memory. Tried to allocate ? MiB (GPU ?; ? GiB total capacity; ? GiB already allocated; ? MiB free; ? cached)
```
- 我试过先把图片用 GPU 传给每一层网络，然后再把网络丢回 GPU：
- ```
for m in self.children():
m.cuda()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
```
- 但效果不佳。有没什么小技巧可以用一个小 GPU 训练一个大网络？
Answers:
- SHAGUN SHARMA - vote: 57
- 即便
- ```
import torch
torch.cuda.empty_cache()
```
- 提供了一个不错的方式来清理没用的 CUDA 内存，以及手动清理不再使用的变量，
- ```
import gc
del variables
gc.collect()
```
- 但清理后依然报错，因为 PyTorch 实际上并不清理内存，而是清理变量的内存引用。所以找到一个更合适的 batch_size 可能是个更好的选择（虽然有时很不方便）。
- 另一个详细获得 GPU 内存信息的方式：
- ```
torch.cuda.memory_summary(device=None, abbreviated=False)
```
- 这里的所有参数都是可选的，它可以给出一个清晰的内存分配总结，方便我们调整代码避免再次出现 CUDA out of memory。
- 还可以试着迭代数据，改变网络大小或是分解网络，都挺有效的（有的模型会占据较多内存，比如迁移学习）。
- Rahul - vote: 32
- 减少 batch_size。我训练的时候有这个错：
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiBtotal capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiBreserved in total by PyTorch)
- 、我把 batch_size 从 32 调到 15 的时候就正常了。
- Nicolas Gervais - vote: 21
- 迭代的训练数据，或者用小的 batch_size。不要一次性把所有数据都塞给 CUDA，像这样分开塞会更好：
- ```
for e in range(epochs):
  for images, labels in train_loader:   
      if torch.cuda.is_available():
          images, labels = images.cuda(), labels.cuda()   
      # blablabla  
```
- 改变 dtypes 降低点精度会使用更低内存。例如 torch.float16 或 torch.half 都不错。

How to avoid CUDA out of memory in PyTorch

voilalex asked:
- I think it\'s a pretty common message for PyTorch users with low GPU memory:
  下面的报错对于缺少 GPU 资源的 PyTorch 用户们很常见：
- ```
RuntimeError: CUDA out of memory. Tried to allocate ? MiB (GPU ?; ? GiB total capacity; ? GiB already allocated; ? MiB free; ? cached)
```
- I tried to process an image by loading each layer to GPU and then loading it back:
  我试过先把图片用 GPU 传给每一层网络，然后再把网络丢回 GPU：
- ```
for m in self.children():
m.cuda()
x = m(x)
m.cpu()
torch.cuda.empty_cache()
```
- But it doesn\'t seem to be very effective. I\'m wondering is there any tips and tricks to train large deep learning models while using little GPU memory.
  但效果不佳。有没什么小技巧可以用一个小 GPU 训练一个大网络？
Answers:
- SHAGUN SHARMA - vote: 57
- Although
  即便
- ```
import torch
torch.cuda.empty_cache()
```
- provides a good alternative for clearing the occupied cuda memory and we can also manually clear the not in use variables by using,
  提供了一个不错的方式来清理没用的 CUDA 内存，以及手动清理不再使用的变量，
- ```
import gc
del variables
gc.collect()
```
- But still after using these commands, the error might appear again because pytorch doesn\'t actually clears the memory instead clears the reference to the memory occupied by the variables.So reducing the batch_size after restarting the kernel and finding the optimum batch_size is the best possible option (but sometimes not a very feasible one).
  但清理后依然报错，因为 PyTorch 实际上并不清理内存，而是清理变量的内存引用。所以找到一个更合适的 batch_size 可能是个更好的选择（虽然有时很不方便）。
- Another way to get a deeper insight into the alloaction of memory in gpu is to use:
  另一个详细获得 GPU 内存信息的方式：
- ```
torch.cuda.memory_summary(device=None, abbreviated=False)
```
- wherein, both the arguments are optional. This gives a readable summary of memory allocation and allows you to figure the reason of CUDA running out of memory and restart the kernel to avoid the error from happening again (Just like I did in my case).
  这里的所有参数都是可选的，它可以给出一个清晰的内存分配总结，方便我们调整代码避免再次出现 CUDA out of memory。
- Passing the data iteratively might help but changing the size of layers of your network or breaking them down would also prove effective (as sometimes the model also occupies a significant memory for example, while doing transfer learning).
  还可以试着迭代数据，改变网络大小或是分解网络，都挺有效的（有的模型会占据较多内存，比如迁移学习）。
- Rahul - vote: 32
- Just reduce the batch size, and it will work.While I was training, it gave following error:
  减少 batch_size。我训练的时候有这个错：
CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiBtotal capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiBreserved in total by PyTorch)
- And I was using batch size of 32. So I just changed it to 15 and it worked for me.
  我把 batch_size 从 32 调到 15 的时候就正常了。
- Nicolas Gervais - vote: 21
- Send the batches to CUDA iteratively, and make small batch sizes. Don\'t send all your data to CUDA at once in the beginning. Rather, do it as follows:
  迭代的训练数据，或者用小的 batch_size。不要一次性把所有数据都塞给 CUDA，像这样分开塞会更好：
- ```
for e in range(epochs):
  for images, labels in train_loader:   
      if torch.cuda.is_available():
          images, labels = images.cuda(), labels.cuda()   
      # blablabla  
```
- You can also use dtypes that use less memory. For instance, torch.float16 or torch.half.
  改变 dtypes 降低点精度会使用更低内存。例如 torch.float16 或 torch.half 都不错。