TSD Operational Log
We are currently experiencing issues with data import/export.
We are working to resolve the issue.
We are currently having some trouble with the dataloader, so files in nettskjema-data are not being updated.
We are working to resolve the issue, and answers will be updated once resolved.
While the problems are ongoing, you can access the nettskjema-anwers using the internal nettskjema portal.
Another patch to TSD storage, which should resolve our sporadic issues. The upgrade will be performed node by node, which may cause temporary disconnection to virtual machines' access to data.
Starts at 07:00 wednesday 13th, will take about 10 minutes.
Update: The patching finished at 07:13, operation back to normal ~07:15
This affects NFS on TSD clients as well as central services. SMB (Windows) is not affected. We're restarting services to restore connections, and actively working with the 3rd party vendor to resolve the underlying issues of late.
Read & write is confirmed to be working from clients 08:22
Storage still have problems, and had a few events with NFS hangs in the afternoon:
14:45 We're notified of NFS-hangs, and notice one of two protocol nodes has a downed NFS service. Commence to restart it.
14:59 Node is back up again, mounts work.
15:36 Discovers the other protocol node have problems with write on its NFS exports. Proceed to restart this one as well.
15:46 Node 2 back up, production back to normal.
15:57 NFS went down again on node 1. Restarting services
16:07 Node back up, production back.
16:15 Again, discover that the other protocol node hangs NFS-hangs on write, and restart.
16:20 Both nodes are operational, and neither have hangs on write anymore....
After storage problems in the night, we discovered this morning that all clients still lacked write access. This affects NFS on TSD clients as well as central services. SMB (Windows) is not affected.
Measures are currently being taken to recover connection, and 3rd is notified in order to resolve the instability ASAP.
07:59 - 1/2 connections back up.
08:06 - Both connections back up, storage is in fully operational.
For now. We'll need to follow up with the vendor further.
The general storage issues have been resolved, this procedure is simply to apply finalizing touches on the network interfaces. This is expected to improve NFS performance.
The tuning will be applied node by node from 22:00, and takes about 10 minutes. This may cause temporary issues with virtual machines' access to data. We'll be following up any such issues.
Maintenance complete 22:15
In regards to the ongoing storage issues, an upgrade to the NFS implementation will be installed on central storage tomorrow morning from 07:00.
The maintenance requires full takedown for the duration of the upgrade, approximately 10 minutes.
TSD will therefore have complete downtime for both Linux and Windows clients.
We apologize for the exceptionally short notice, unfortunately this is a required measure to aid us out of a critical situation.
07:30: Due to initial issues with installation, the downtime is extended to 07:40. We apologize for the inconvenience.
07:38: Storage is upgraded, and production is resumed.
This affects NFS on TSD clients as well as central services. SMB (Windows) is not affected. We're actively working with the 3rd party vendor to resolve the issue.
Update 04.11.2024
File import and export functionality is restored.
TSD will apply a security patch from IBM to the storage system, which may cause temporary issues with virtual machines' access to data. We'll be following up any such issues.
There are currently issues with storage in TSD, causing some users to experience problems with logging in, a blank screen or sessions to disconnect abruptly. We are actively working to resolve the issue.
Update 23-10-2024: Most issues have been resolved, but we're still looking at individual cases and will continue work with the third party vendor.
We are currently experiencing technical difficulties with the storage services, and are working to resolve the issue.
Update 14:15: The issue has been resolved.
We are performing storage system maintenance on Wednesday 9 October from 16:00 CET to apply security updates recommended by IBM. During this time period there will be service interruptions to virtual machines.
A maintenance reservation for the downtime has been set on Colossus starting at 15:00. Jobs that cannot complete before the downtime will not be scheduled until after the downtime. Any running jobs and processes will be killed. If you had running jobs or processes at the start of the downtime, please check your job output for errors. To be on the safe side, save your work, cancel running jobs and log off prior to the downtime. (Note that this is one hour earlier than the storage maintenance - we will use the opportunity to upgrade Slurm on Colossus, instead of having a separate downtime for that.)
During the downtime, follow this opslog for updates.
There were issues with project storage in TSD on Monday the 30th of September, between approximately 12:45 AM and 1:30 PM. TSD was partially inaccessible during this time period, and users who were logged in experienced the project storage being unavailable. The issue was quickly resolved, and should not have any further consequences. If you experience any issues, please contact us.
GPU-3 in the UiO allocation of Colossus is down and will need replacement parts. This means there's currently only one GPU node available. Until its restored, expect longer queue times in the UiO allocation when requesting GPUs.
Update 2024-11-04: The node is back in production.
TSD is unavailable for all users due to planned maintenance.
It will be unavailable for most of the day. This notice will be updated once maintenance is complete.
After a brief hiccup in the system yesterday, many projects are receiving the error "This desktop currently has no desktop sources available. Please try connecting to this desktop again later, or contact your system administrator." when trying to log in.
[RESOLVED]: A fix for was issued just after 10:00 on September 3rd. If you still receive the same error, please send an email to tsd-drift@usit.uio.no, and we can reboot your machine manually.
The certificate for the public TSD consent portal has expired, causing https://consent-portal.tsd.usit.no/ to be unavailable for TSD users. We are actively working to resolve the issue.
The EL7 submit vms have been powered off. Please use the new EL9 pXX-hpc-01 submit hosts instead for submitting jobs to Colossus.
Colossus is currently not able to contact the TSD license server, which means that it is not possible to use licensed software on Colossus.
We are working on solving the issue.
Whisper does not work because there is no Whisper module for Colossus yet. We're compiling a new version, which will require updates to the transcribe script. Updated instructions will follow.
Update 24-08-15: A new module "whisper/20231117-foss-2023a-CUDA-12.1.1" is now available. Updated "transcribe_data" and "whisper.sm" files are available in "/tsd/shared/software/whisper". Please copy them to your local whisper directory (and overwrite/remove the old files). Please note the following changes:
- The Whisper version, toolchain and CUDA version have been updated.
- Whisper now defaults to model "large-v3" (instead of "large-v2" in the previous module). To use the v2 model, see below.
- All OpenAI models are now included in the module and can be used using the environment variables set by the module. Please load the module and run "printenv|grep EBWHISPERMODEL" to see the available models. To use...
There have been some issues with dataloader processing for some projects, due to an outage of a central service. This can cause some delays in processing of data stored in data/durable/nettskjema-data.
No data has been lost, and is already available through the internal nettskjema portal. For most projects the data i nettskjema-data should be updated in a few hours.
If nettskjema-data has still not been updated by tomorrow, send us an email at tsd-drift@usit.uio.no with your project number and form id.
There is currently a login issue with data.tsd.usit.no. We are working on resolving it.
The operating system (OS) on Colossus will be upgraded starting Monday August 5th. This is a major upgrade, and will take one week. During the upgrade, Colossus will be unavailable.
Files in home directories and project areas will be accessible during the upgrade.
The main reason for the upgrade is that the current OS is very old, and soon will soon reach end of life.
During the downtime, we will reorganize the cluster, upgrade the networking and upgrade the OS from CentOS7 to Rocky9. Rocky, like CentOS, is a RedHat clone.
The software stack (available via "module load") will be reinstalled. We will install toolchains 2021a and newer. This means that not all
old versions of all software will be available after the upgrade. If you use software older than this, please start using newer versions, if possible. If something is missing, please submit the software reque...
The existing hardware and software for the host tsd-fx03 need to be replaced. The new host is ready to take over and we plan to do the switch over to the new host Monday 2024-04-29. The swap is expected to take less than 5 minutes and in the ideal case should not be detectable by its users.
The switch took longer than expected, due to issues with routing affecting its services. The new server was put in production today, 2024-05-13.