What is a Cluster ?

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system.

Wikipedia

In order to provide high computation performances, clusters combines hundreds to thousands of computers, called nodes, all connected with high-performance interconnections. Most nodes are designed for high-performance computations, but clusters also require specialized ones to support highly parallel file systems, databases, job scheduling or login as pictured in the image below.

_images/cluster_overview2.png

All clusters run on GNU/Linux distributions, therefore a minimum knowledge of GNU/Linux and bash is required to use them (tutorial).

Login Node

To execute computation intensive processes, you must connect to a login node and submit a job using the job scheduler (slurm). The login nodes can only handle very short and light processes otherwise the cluster may become inaccessible. In other words, do not execute long or compute intensive processes on login nodes because it affects all other users.

File system

Clusters have different types of file systems to support different use cases. Globally available file systems ($HOME or $SCRATCH, see Mila or CC for more info) are provided so that software or data required for jobs can be accessed from any nodes. Backed up file-systems ($PROJECT) provides more space and can handle large files but cannot sustain highly parallel accesses. Each compute nodes have local file systems ($SLURM_TMPDIR) that are more efficient but erased at the end of the job execution.

Ressources Available at Mila

This table contains the computational ressources Mila has access to.

Cluster

CPUs

GPUs

Mila

248

Beluga

34k

688 V100

Cedar

27k

584 P100

Graham

36k

320 P100

Helios

216 k80

Niagara

60k

Total

134k

2040

Mila Cluster

for regular development and few jobs (< 5)

Compute Canada Clusters

for many jobs, multi-nodes and/or multi-GPU jobs and long-walltime jobs

Cloud clusters

burst capacity for deadline-close jobs