1 Frequently asked questions (FAQs)¶
- 1 Frequently asked questions (FAQs)
- 1.1 Connection/SSH
- 1.2 Shell
- 1.3 Slurm
- 1.3.1 How can I get an interactive shell on the cluster ?
srun: error: --mem and --mem-per-cpu are mutually exclusiveerror ?
- 1.3.3 How can I see where and if my jobs are running ?
Unable to allocate resources: Invalid account or account/partition combination specified
- 1.3.5 How do I cancel a job?
- 1.3.6 How can access a node on which one of my job is running ?
- 1.3.7 I’m getting
Permission denied (publickey)while trying to connect to a node ?
- 1.3.8 Where do I put my data during a job ?
- 1.3.9 I am getting the following error
slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup.
- 1.3.10 I am getting the following error
fork: retry: Resource temporarily unavailable
1.2.1 How do I change my shell¶
By default you will be assigned
/bin/bash as a shell. If you would like to change for
another one, please submit a support ticket.
1.3.1 How can I get an interactive shell on the cluster ?¶
salloc [--slurm_options] without any executable at the end of the command, this will launch your
default shell on an interactive session. Remember that an interactive session is bound to the login node where
you start it so you could risk loosing your job if the login node becomes unreachable.
srun: error: --mem and --mem-per-cpu are mutually exclusive error ?¶
You can safely ignore this,
salloc has a default memory flag in case you don’t provide one.
1.3.3 How can I see where and if my jobs are running ?¶
squeue -u YOUR_USERNAME to see all your job status and locations.
To get more info on a running job, try
scontrol show job #JOBID
Unable to allocate resources: Invalid account or account/partition combination specified¶
Chances are your account is not setup properly. You should email
1.3.5 How do I cancel a job?¶
scancel #JOBID command with the jobid of the job you want cancelled. In the case you want
to cancel all your jobs, type
scancel -u YOUR_USERNAME. You can also cancel all your pending jobs for
scancel -t PD.
1.3.6 How can access a node on which one of my job is running ?¶
You can ssh into a node on which you have a job running, your ssh connection will be adopted by your job, i.e.
if your job finishes your ssh connection will be automatically terminated. In order to connect to a node, you need to
have password-less ssh either with a key present in your home or with an
ssh-agent. You can generate a key on the
login node for password-less like this:
ssh-keygen (3xENTER) cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 600 ~/.ssh/authorized_keys chmod 700 ~/.ssh
1.3.7 I’m getting
Permission denied (publickey) while trying to connect to a node ?¶
See previous question
1.3.8 Where do I put my data during a job ?¶
/home as well as the datasets are on shared file-systems, it is recommended to copy them to the
to better process them and leverage higher-speed local drives. If you run a low priority job subject to preemption, it’s better
to keep any output you want to keep on the shared file systems because the
$SLURM_TMPDIR is deleted at the end of each job.
1.3.9 I am getting the following error
slurmstepd: error: Detected 1 oom-kill event(s) in step #####.batch cgroup.¶
You exceeded the amount of memory allocated to your job, either you did not request enough memory or you have a
memory leak in your process. Try increasing the amount of memory requested with