Monitoring GPU Metrics
There is no GPU on the login node, this command will not work on the login node.
There are two methods to monitor the GPU utilisation of your job.
Method 1 (Recommended)
Append the highlighted parameters to your batch script before running your main workload, as shown in line 6 of the example below.
# This command assumes that you've already created the environment previously
# We're using an absolute path here. You may use a relative path, as long as SRUN is execute in the same working directory
source ~/myenv/bin/activate
# Find out which GPU you are using
srun whichgpu
# If you require any packages, install it as usual before the srun job submission.
# pip3 install numpy
# Submit your job to the cluster
srun --gres=gpu:1 python /path/to/your/python/script.py
When the job is submitted to the cluster, an output file will be created in your current working directory. This will contain all the information about you assigned GPU. Example —
[IS000G3@origami ~]$ cat IS000G3-20904.out
You are allocated NVIDIA GeForce RTX 2080 Ti on mustang
You are using GPU 0
... output truncated
The above message indicates that your job is running on the compute node mustang and it has been provisioned with GPU 0. With this information, log onto the Grafana dashboard while on a SMU network, or a SMU VPN.
On the top left, select the appropriate computer node followed by the GPU number. In our example, we have selected mustang and GPU 0. On the top right, adjust the time to view the relevant usage staticics of your GPU utilisation.
Method 2 (SSH into compute node, not recommended)
When there are multiple jobs on the same node, you will be randomly ssh-ed into a job id. You cannot select which job to be ssh-ed into.
Users can monitor GPU usage by following the steps below —
-
Execute
myqueue -
Find out the node under
NODES NODELIST(REASON)at the last column, in this casealoha -
ssh into
aloha -
execute
nvidia-smi[IS000G3@origami ~]$ myqueue
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT NODES NODELIST(REASON)
6435 student QAsubmission IS000G3 RUNNING 2:26:10 6:00:00 1 aloha
[IS000G3@origami ~]$ ssh aloha
The authenticity of host 'aloha (10.0.104.54)' can't be established.
ED25519 key fingerprint is SHA256:1GHYCZp4WkNy0oMV2vtil68fw8OUxdXAFz5uS7mUjbo.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'aloha,10.0.104.54' (ED25519) to the list of known hosts.
Last login: Tue Feb 7 17:22:50 2023 from 10.0.104.102
[IS000G3@aloha ~]$ nvidia-smi