This guide assumes that you have access to the AI Hub machine learning infrastructure. In this guide we will use ML3 as an example, but everything in this guide should work on every machine in the AI Hub.
The Jupyter Lab environment is meant for development and easy visualization. For longer running tasks we recommend users to run scripts in a batch fashion to easier facilitate sharing of GPU resources.
How to start a Jupyter Notebook session on ML3.
- Windows users
There were some reported issues when using windows command line when using port forwarding. So we recommend you to use Git for Windows (https://gitforwindows.org/) software instead.
- Login to ml3.hpc.uio.no
-
ssh -J <USER_NAME>@gothmog.uio.no <USER_NAME>@ml3.hpc.uio.no
- Note that we could use any ML node for this, to ensure that you understand the guide try to follow the guide on another ML node
-
- Load the module
-
module purge module load JupyterLab/3.2.8-GCCcore-10.3.0 # Optional: if you want to use TensorFlow module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1 # Optional: if you want to use PyTorch module load PyTorch/1.11.0-foss-2021a-CUDA-11.3.1
-
- Start Jypyter
-
jupyter-lab --no-browser
- This will print something similar to what's below.
- Please note the URL with the token (highlighted in red below), this is the key to start the notebook
-
[W 08:36:08.506 LabApp] JupyterLab server extension not enabled, manually loading...
[I 08:36:08.516 LabApp] JupyterLab extension loaded from /storage/software/JupyterLab/2.2.8-GCCcore-8.3.0/lib/python3.
7/site-packages/jupyterlab
[I 08:36:08.517 LabApp] JupyterLab application directory is /storage/software/JupyterLab/2.2.8-GCCcore-8.3.0/share/jup
yter/lab
[I 08:36:08.519 LabApp] Serving notebooks from local directory: /itf-fi-ml/home/jorgehn
[I 08:36:08.519 LabApp] The Jupyter Notebook is running at:
[I 08:36:08.519 LabApp] http://localhost:8888/?token=79197a78ada474fb34680a8beaebb55207b01b92b9ed4c02
[I 08:36:08.519 LabApp] or http://127.0.0.1:8888/?token=79197a78ada474fb34680a8beaebb55207b01b92b9ed4c02
[I 08:36:08.519 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 08:36:08.583 LabApp]
To access the notebook, open this file in a browser:
file:///itf-fi-ml/home/jorgehn/.local/share/jupyter/runtime/nbserver-918461-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=79197a78ada474fb34680a8beaebb55207b01b92b9ed4c02
or http://127.0.0.1:8888/?token=79197a78ada474fb34680a8beaebb55207b01b92b9ed4c02
- Do not close the terminal or press CTRL-C on the jupyter-lab instance until you are done using jupyter-lab
4. Next we will need to SSH into ML3 again, but this time we will forward the port used by Jupyter
- Note the port used by Jupyter from the URL above, in this case the URL was
http://localhost:8888/?token=79197a78ada474fb34680a8beaebb55207b01b92b9ed4c02
and the port is 8888 - Without closing the terminal where Jupyter-lab is running, open a new terminal and do the following steps
- Next we will SSH into ML3 with this port exposed to our local machine, this is accomplished with the following template:
ssh -L <port>:localhost:<port> -J <username>@gothmog.uio.no <username>@ml3.hpc.uio.no
Or to use the actual port we found above:ssh -L 8888:localhost:8888 -J <username>@gothmog.uio.no <username>@ml3.hpc.uio.no
5. Then we are finally ready to open the URL printed by Jupyter on our local machine. Simply open the URL in a browser on your local machine.
6. When you are finished please remember to shutdown Jupyter and logout from the machine. This will release the port for use by others.
Using a specific GPU
Since the AI Hub machines are shared resources, with a first come first serve setup, it can sometimes be useful to tell Jupyter to only use one GPU or a specific GPU that is less tasked.
To accomplish this we can set the variable CUDA_VISIBLE_DEVICES
.
- First we will use
nvidia-smi
to show the available GPUs in the system and their current load. Simply runnvidia-smi
after logging in, the output should look something like the following:+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... On | 00000000:18:00.0 Off | N/A | | 34% 48C P2 54W / 250W | 10096MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... On | 00000000:3B:00.0 Off | N/A | | 32% 45C P2 84W / 250W | 7422MiB / 11019MiB | 17% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 GeForce RTX 208... On | 00000000:86:00.0 Off | N/A | | 28% 27C P8 11W / 250W | 3MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 GeForce RTX 208... On | 00000000:AF:00.0 Off | N/A | | 28% 29C P8 34W / 250W | 3MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
- From the above we can see that there are four GPUs connected to the machine, luckily for us none of them are heavily loaded, however, to show the concept we will restrict ourselves to using only GPU 3 (GPU 0 is usually the default so using something else is usually a good tip to avoid having other users take up our resources).
- Next we will load Jupyter and TensorFlow as described above, but when launching
jupyter-lab
we will add the following to restrict our usage to only one GPU:CUDA_VISIBLE_DEVICES=3 jupyter-lab --no-browser
- The result will be that only a single GPU is visible to TensorFlow in our Jupyter instance. Note that when we used
3
above, it means that we want only GPU with ID 3 to show up in TensorFlow. We could also have usedCUDA_VISIBLE_DEVICES=1,3
to show both GPU 1 and GPU 3. The default is to use all devices and if only one GPU is needed GPU 0 is usually used. - In the image below we show the expected behavior in TensorFlow. Note that only one GPU is shown, it is given ID 0, even though we requested ID 3, but that is simply TensorFlow renaming the GPU.
Debugging issues
- Following error when trying to login
ssh -L 8880:localhost:8880 -J <USER_NAME>@gothmog.uio.no <USER_NAME>@ml3.hpc.uio.no
bind [127.0.0.1]:8880: Address already in use channel_setup_fwd_listener_tcpip: cannot listen to port: 8880
- Reason: you already have a port forwarding setup
- Solution: Logout from all other sessions
- Solution 2: (if you have lost terminal, e,g. Forgetting to logoff before summer holidays and trying to use it when you return.
- See if there are already running sessions and kill them
- ps aux ..