译（五十八）-Pytorch dataloader中worker数的工作原理

MWHLS • 2022/04/30 pm7:13 • Python, Pytorch, 计算机语言

已阅： 138

stackoverflow热门问题目录
如有翻译问题欢迎评论指出，谢谢。
这篇确实只有一个回答。

PyTorch dataloader 里的 worker 数的工作原理是什么？

How does the number of workers parameter in PyTorch dataloader actually work?

PyTorch dataloader 里的 worker 数的工作原理是什么？

floyd asked:
1. 如果 num_workers 是 2，是不是意味着两个 batch 被送入内存，其中一个被送入 GPU？或者是把三个送入内存，然后其中一个送入 GPU？
2. 当 num_workers 高于 CPU 核心时会怎样？我试了下但还是顺利执行了，到底发生了什么？（我觉得 num_workers 最大应该是核心数）
3. 如果 num_workers 被设为 3 且在训练期间没有 batch 在 GPU 中，主进程会等待 workers 去读取 batch 还是不等待直接读取一个 batch？
Answers:
- Shihab Shahriar Khan - vote: 71
  1. 当 num_workers>0 时，只有这些 workers 会检索数据，主进程不会。所以当 num_workers=2 时，会有最多两个 workers 同时丢数据到内存，而不是三个。
  2. CPU 通常跑 100 个进程也不会有问题，worker 进程亦然。所以超过 CPU 核心的 num_workers 没问题。不过对于效率问题而言，这取决于你的 CPU 在其它任务上的消耗、CPU 性能、硬盘加载速度等。总之，性能问题受多因素影响，所以设置 num_workers 为 CPU 核心数就好，不要过多。
  3. 不会。DataLoader 不会随机返回当前内存中的可用数据，而是使用 batch_sampler 来决定接下来返回哪个 batch。每个 batch 被分配给一个 worker，主进程会等待直到 worker 检索到指定的 batch。
  - 还有一点，DataLoader 并不是用来发数据给 GPU 的，cuda() 才是。
  - 再次编辑：不要在 Datasets 的 __getitem__() 里面调用 cuda()，见 psarka 的评论。
    - 译者注：psarka 的评论：
      我延伸一下最后那句，在 Dataset 对象里调用 .cuda() 会让每个样本分开送给 GPU，而非以 batch 的形式送入，会产生大量开销，慎用。

How does the number of workers parameter in PyTorch dataloader actually work?

floyd asked:
1. If num_workers is 2, Does that mean that it will put 2 batches in the RAM and send 1 of them to the GPU or Does it put 3 batches in the RAM then sends 1 of them to the GPU?
  如果 num_workers 是 2，是不是意味着两个 batch 被送入内存，其中一个被送入 GPU？或者是把三个送入内存，然后其中一个送入 GPU？
2. What does actually happen when the number of workers is higher than the number of CPU cores? I tried it and it worked fine but How does it work? (I thought that the maximum number of workers I can choose is the number of cores).
  当 num_workers 高于 CPU 核心时会怎样？我试了下但还是顺利执行了，到底发生了什么？（我觉得 num_workers 最大应该是核心数）
3. If I set num_workers to 3 and during the training there were no batches in the memory for the GPU, Does the main process waits for its workers to read the batches or Does it read a single batch (without waiting for the workers)?
  如果 num_workers 被设为 3 且在训练期间没有 batch 在 GPU 中，主进程会等待 workers 去读取 batch 还是不等待直接读取一个 batch？
Answers:
- Shihab Shahriar Khan - vote: 71
  1. When num_workers>0, only these workers will retrieve data, main process won\'t. So when num_workers=2 you have at most 2 workers simultaneously putting data into RAM, not 3.
    当 num_workers>0 时，只有这些 workers 会检索数据，主进程不会。所以当 num_workers=2 时，会有最多两个 workers 同时丢数据到内存，而不是三个。
  2. Well our CPU can usually run like 100 processes without trouble and these worker processes aren\'t special in anyway, so having more workers than cpu cores is ok. But is it efficient? it depends on how busy your cpu cores are for other tasks, speed of cpu, speed of your hard disk etc. In short, its complicated, so setting workers to number of cores is a good rule of thumb, nothing more.
    CPU 通常跑 100 个进程也不会有问题，worker 进程亦然。所以超过 CPU 核心的 num_workers 没问题。不过对于效率问题而言，这取决于你的 CPU 在其它任务上的消耗、CPU 性能、硬盘加载速度等。总之，性能问题受多因素影响，所以设置 num_workers 为 CPU 核心数就好，不要过多。
  3. Nope. Remember DataLoader doesn\'t just randomly return from what\'s available in RAM right now, it uses batch_sampler to decide which batch to return next. Each batch is assigned to a worker, and main process will wait until the desired batch is retrieved by assigned worker.
    不会。DataLoader 不会随机返回当前内存中的可用数据，而是使用 batch_sampler 来决定接下来返回哪个 batch。每个 batch 被分配给一个 worker，主进程会等待直到 worker 检索到指定的 batch。
  - Lastly to clarify, it isn\'t DataLoader\'s job to send anything directly to GPU, you explicitly call cuda() for that.
    还有一点，DataLoader 并不是用来发数据给 GPU 的，cuda() 才是。
  - EDIT: Don\'t call cuda() inside Dataset\'s __getitem__() method, please look at @psarka\'s comment for the reasoning
    再次编辑：不要在 Datasets 的 __getitem__() 里面调用 cuda()，见 psarka 的评论。
    - 译者注：psarka 的评论：
      Just a remark to the last sentence - it is probably not a good idea to call .cuda() in the Dataset object, as it will have to move each sample (rather than the batch) to GPU separately, incurring a lot of overhead.
      我延伸一下最后那句，在 Dataset 对象里调用 .cuda() 会让每个样本分开送给 GPU，而非以 batch 的形式送入，会产生大量开销，慎用。