Every compute node on the mila cluster has a monitoring daemon allowing you to check the resource usage of your model and identify bottlenecks. You can access the monitoring web page by typing in your browser: <node>

For example, if I have a job running on eos1 I can type and the page below should appear.


Notable Sections

You should focus your attention on the metrics below

  • CPU
    • iowait (pink line): High values means your model is waiting on IO a lot (disk or network)

  • RAM
    • Make sure you are only allocating enough to make your code run and not more otherwise you are wasting resources.

  • NV
    • Usage of each GPU

    • You should make sure you use the GPU to its fullest
      • Select the biggest batch size if possible

      • Spawn multiple experiments

  • Users:
    • In some cases the machine might seem slow, it may be useful to check if other people are using the machine as well