Cluster Nodes

Complete List of Nodes

Name

GPU

CPUs

Sockets

Cores/Socket

Threads/Core

Memory (Gb)

TmpDisk (Tb)

Arch

Features

Primary

Secondary

GPU Arch and Memory

Model

#

Model

#

BART

bart[12-13]

k6000

2

8

1

4

2

64

3.6

x86_64

kepler,12gb

bart14

k6000

1

8

1

4

2

64

3.6

x86_64

kepler,12gb

EOS

eos[1-6]

titanx

1

8

1

4

2

15

3.6

x86_64

maxwell,12gb

eos[11-15]

titanx

1

8

1

4

2

32

3.6

x86_64

maxwell,12gb

eos[16-19]

titanx

1

8

1

4

2

64

3.6

x86_64

maxwell,12gb

eos20

k6000

1

8

1

4

2

64

3.6

x86_64

kepler,12gb

eos21

k6000

1

8

1

4

2

64

3.6

x86_64

kepler,12gb

eos22

titanx

1

8

1

4

2

48

3.6

x86_64

maxwell,12gb

KEPLER

kepler[2-3]

k80

8

16

2

4

2

256

3.6

x86_64

tesla,12gb

kepler4

m40

4

16

2

4

2

256

3.6

x86_64

maxwell,24gb

kepler5

v100

2

m40

1

16

2

4

2

256

3.6

x86_64

volta,12gb

LETO

leto01

titanv

1

12

1

6

2

64

3.6

x86_64

volta,12gb

leto[02,07]

titanx

2

12

1

6

2

32

3.6

x86_64

maxwell,12gb

leto03

rtx2080

1

12

1

6

2

32

3.6

x86_64

turing,12gb

leto[04,08]

titanx

2

display

1

12

1

6

2

32

3.6

x86_64

maxwell,12gb

leto[11-12

titanx

2

display

1

12

1

6

2

48

3.6

x86_64

maxwell,12gb

leto[16-17]

P6000

2

12

1

6

2

64

3.6

x86_64

pascal,24gb

leto[21-24,39-40]

titanxp

2

12

1

6

2

64

3.6

x86_64

pascal,12gb

leto20

titanxp

2

12

1

6

2

64

3.6

x86_64

pascal,12gb

leto38

titanxp

2

12

1

6

2

64

3.6

x86_64

pascal,12gb

leto25

gtx1080

2

8

1

4

2

64

3.6

x86_64

pascal,8gb

leto[50-52]

titanx

3

display

1

12

1

6

2

64

3.6

x86_64

maxwell,12gb

MILA

mila01

v100

8

80

2

20

2

512

7

x86_64

tesla,16gb

mila02

v100

8

80

2

20

2

512

7

x86_64

tesla,32gb

mila03

v100

8

80

2

20

2

512

7

x86_64

tesla,32gb

POWER9

power9[1-2]

v100

4

128

2

16

4

586

0.88

power9

tesla,nvlink,16gb

TITAN RTX

rtx[6,9]

titanrtx

2

20

1

10

2

128

3.6

x86_64

turing,24gb

rtx[1-5,7-8]

titanrtx

2

20

1

10

2

128

0.93

x86_64

turing,24gb

**NEW** APOLLO

apollov[01-05]

v100

8

80

2

20

2

380

3.6

x86_64

tesla,nvlink,32gb

apollor[06-16]

rtx8000

8

80

2

20

2

380

3.6

x86_64

turing,48g

Special Nodes

Power9

Power9 servers are using a different processor instruction set than Intel and AMD (x86_64). As such you need to setup your environment again for those nodes specifically.

  • Power9 Machines have 128 threads. (2 processors / 16 cores / 4 way SMT)

  • 4 x V100 SMX2 (16 GB) with NVLink

  • In a Power9 machine GPUs and CPUs communicate with each other using NVLink instead of PCIe.

This allow them to communicate quickly between each other. More on LMS

Power9 have the same software stack as the regular nodes and each software should be included to deploy your environment as on a regular node.

AMD

Warning

As of August 20 the GPUs had to return back to AMD. Mila will get more samples. You can join the amd slack channels to get the latest information

Mila has a few node equipped with MI50 GPUs.

 srun --gres=gpu -c 8 --reservation=AMD --pty bash

# first time setup of AMD stack
 conda create -n rocm python=3.6
 conda activate rocm

 pip install tensorflow-rocm
 pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl