Frequently Anticipated Questions
Why is the cluster unresponsive?
When there is a high amount of IO operations on the cluster, the system would respond in a slower manner.
If you are in a middle of a ls
or copying operation, it may take a longer duration to complete.
Why can't I login/logon?
Ensure that your system meets the following requirements —
- ClearPass is installed with a Healthy Status.
- Connected to the WLAN-SMU WiFi
- No external VPN services enabled/turned on
Why can't I login from outside of SMU?
To access the GPU cluster while outside the SMU network, connect to the SMU VPN (Cisco).
If you do not have a school VPN account, approach your instructor for more information.
I am unable to change password on first login
When logging into the cluster for the first time, you are prompted to change your password and you need to enter the "old" password provided by your Instructor.
I am unable to detect/find/locate/load GPUs
Scripts must be executed with the job scheduler, in order for a GPU to be assigned. Refer to the Job submission guide for more information.
Help!! My job is not running
There are a few reasons why a job is not running. Execute the following command squeue --me
to check on the state of your job.
If the state is PD
, it means that all the resources are currently in use and your job is being queued.
Why does my job fail?
There are a few reasons why a job fails
- Is it the only job failing on the cluster?
- Are all file paths referenced in the Python script available on the cluster?
- Have you installed the right python libraries?
- Did you make the template file executable?
- Have the right modules been loaded for the libraries (eg Tensorflow/PyTorch)?
Which CUDA (toolkit) version do I use?
The crimson cluster uses Nvidia 3090 GPUs, hence CUDA 11.1 or higher should be used
Unable to find my question here
Kindly post your questions at the Github forum in the following format —
Subject [Accoutname] <Your issue>
Description of your issue
- What happened?
- What should happen instead?
- Steps to reproduce
- Screenshots if any
- Upload the .out file if it's available