Every compute node on the mila cluster has a monitoring daemon allowing you to
check the resource usage of your model and identify bottlenecks.
You can access the monitoring web page by typing in your browser:
For example, if I have a job running on
eos1 I can type
the page below should appear.
You should focus your attention on the metrics below
iowait (pink line): High values means your model is waiting on IO a lot (disk or network)
Make sure you are only allocating enough to make your code run and not more otherwise you are wasting resources.
Usage of each GPU
- You should make sure you use the GPU to its fullest
Select the biggest batch size if possible
Spawn multiple experiments
In some cases the machine might seem slow, it may be useful to check if other people are using the machine as well