Getting an OOM error? Here's how to debug it

OOM, or out-of-memory error occurs when the GPU VRAM or the CPU RAM runs out of memory.

The error is usually caused due to the model size or the batch size being loaded.

The following methods are best to debug the source of the error:

  • Check if the model is too big to fit on the given hardware specs. You can do this by checking the number of parameters and size of tensors and their type (float32/float64 etc.)
  • Check if multiple training sessions are running simultaneously
  • Check if some other Jupyter Notebook also is using up RAM/VRAM

This link assists with the OOM error encountered using Tensorflow.

To check the above, you can open a new terminal window from the Jupyter or VS Code menu and monitor the hardware by running the following commands:

  • nvidia-smi : to check what processes are running on GPU and the amount of memory allocated.
  • htop : to check which processes are using RAM and the amount of RAM left.