TSD Operational Log
The general storage issues have been resolved, this procedure is simply to apply finalizing touches on the network interfaces. This is expected to improve NFS performance.
The tuning will be applied node by node from 22:00, and takes about 10 minutes. This may cause temporary issues with virtual machines' access to data. We'll be following up any such issues.
In regards to the ongoing storage issues, an upgrade to the NFS implementation will be installed on central storage tomorrow morning from 07:00.
The maintenance requires full takedown for the duration of the upgrade, approximately 10 minutes.
TSD will therefore have complete downtime for both Linux and Windows clients.
We apologize for the exceptionally short notice, unfortunately this is a required measure to aid us out of a critical situation.
07:30: Due to initial issues with installation, the downtime is extended to 07:40. We apologize for the inconvenience.
07:38: Storage is upgraded, and production is resumed.
This affects NFS on TSD clients as well as central services. SMB (Windows) is not affected. We're actively working with the 3rd party vendor to resolve the issue.
Update 04.11.2024
File import and export functionality is restored.
TSD will apply a security patch from IBM to the storage system, which may cause temporary issues with virtual machines' access to data. We'll be following up any such issues.
There are currently issues with storage in TSD, causing some users to experience problems with logging in, a blank screen or sessions to disconnect abruptly. We are actively working to resolve the issue.
Update 23-10-2024: Most issues have been resolved, but we're still looking at individual cases and will continue work with the third party vendor.
We are currently experiencing technical difficulties with the storage services, and are working to resolve the issue.
Update 14:15: The issue has been resolved.
We are performing storage system maintenance on Wednesday 9 October from 16:00 CET to apply security updates recommended by IBM. During this time period there will be service interruptions to virtual machines.
A maintenance reservation for the downtime has been set on Colossus starting at 15:00. Jobs that cannot complete before the downtime will not be scheduled until after the downtime. Any running jobs and processes will be killed. If you had running jobs or processes at the start of the downtime, please check your job output for errors. To be on the safe side, save your work, cancel running jobs and log off prior to the downtime. (Note that this is one hour earlier than the storage maintenance - we will use the opportunity to upgrade Slurm on Colossus, instead of having a separate downtime for that.)
During the downtime, follow this opslog for updates.
There were issues with project storage in TSD on Monday the 30th of September, between approximately 12:45 AM and 1:30 PM. TSD was partially inaccessible during this time period, and users who were logged in experienced the project storage being unavailable. The issue was quickly resolved, and should not have any further consequences. If you experience any issues, please contact us.
GPU-3 in the UiO allocation of Colossus is down and will need replacement parts. This means there's currently only one GPU node available. Until its restored, expect longer queue times in the UiO allocation when requesting GPUs.
Update 2024-11-04: The node is back in production.
TSD is unavailable for all users due to planned maintenance.
It will be unavailable for most of the day. This notice will be updated once maintenance is complete.
After a brief hiccup in the system yesterday, many projects are receiving the error "This desktop currently has no desktop sources available. Please try connecting to this desktop again later, or contact your system administrator." when trying to log in.
[RESOLVED]: A fix for was issued just after 10:00 on September 3rd. If you still receive the same error, please send an email to tsd-drift@usit.uio.no, and we can reboot your machine manually.
The certificate for the public TSD consent portal has expired, causing https://consent-portal.tsd.usit.no/ to be unavailable for TSD users. We are actively working to resolve the issue.
The EL7 submit vms have been powered off. Please use the new EL9 pXX-hpc-01 submit hosts instead for submitting jobs to Colossus.
Colossus is currently not able to contact the TSD license server, which means that it is not possible to use licensed software on Colossus.
We are working on solving the issue.
Whisper does not work because there is no Whisper module for Colossus yet. We're compiling a new version, which will require updates to the transcribe script. Updated instructions will follow.
Update 24-08-15: A new module "whisper/20231117-foss-2023a-CUDA-12.1.1" is now available. Updated "transcribe_data" and "whisper.sm" files are available in "/tsd/shared/software/whisper". Please copy them to your local whisper directory (and overwrite/remove the old files). Please note the following changes:
- The Whisper version, toolchain and CUDA version have been updated.
- Whisper now defaults to model "large-v3" (instead of "large-v2" in the previous module). To use the v2 model, see below.
- All OpenAI models are now included in the module and can be used using the environment variables set by the module. Please load the module and run "printenv|grep EBWHISPERMODEL" to see the available models. To use...
There have been some issues with dataloader processing for some projects, due to an outage of a central service. This can cause some delays in processing of data stored in data/durable/nettskjema-data.
No data has been lost, and is already available through the internal nettskjema portal. For most projects the data i nettskjema-data should be updated in a few hours.
If nettskjema-data has still not been updated by tomorrow, send us an email at tsd-drift@usit.uio.no with your project number and form id.
There is currently a login issue with data.tsd.usit.no. We are working on resolving it.
The operating system (OS) on Colossus will be upgraded starting Monday August 5th. This is a major upgrade, and will take one week. During the upgrade, Colossus will be unavailable.
Files in home directories and project areas will be accessible during the upgrade.
The main reason for the upgrade is that the current OS is very old, and soon will soon reach end of life.
During the downtime, we will reorganize the cluster, upgrade the networking and upgrade the OS from CentOS7 to Rocky9. Rocky, like CentOS, is a RedHat clone.
The software stack (available via "module load") will be reinstalled. We will install toolchains 2021a and newer. This means that not all
old versions of all software will be available after the upgrade. If you use software older than this, please start using newer versions, if possible. If something is missing, please submit the software reque...
The existing hardware and software for the host tsd-fx03 need to be replaced. The new host is ready to take over and we plan to do the switch over to the new host Monday 2024-04-29. The swap is expected to take less than 5 minutes and in the ideal case should not be detectable by its users.
The switch took longer than expected, due to issues with routing affecting its services. The new server was put in production today, 2024-05-13.
We're experiencing some issues with our data portal, but we're on it and working to get things back to normal. Thanks for your patience while we fix the problem.
Several reservations (including tsd) on Colossus are currently unavailable. The Sigma2 allocation is not affected.
Dear TSD user,
We need to temporarily halt self-service for maintenance. We apologize for the inconvenience.
TSD Team
From 2024-03-20 1730 the first 100 (of 250) submit hosts will be migrated to a new virtualization cluster overnight. This requires the vms to be powered off, migrated and powered on. The downtime per host will be about 5-10 minutes.
Update 2024-03-21: The remaining submit hosts will be migrated from 2024-03-21 1730.
Applicants who use TSD Self Service to apply for membership in a TSD project currently end up in a loop when they return from ID-porten, which prevents them from submitting their application.
We are working on the problem.
[2024-03-18 11:00: update] The maintenance is now over, and jobs are running again.
On Monday 18-03-2024 10:00 there will be a short maintenance stop to apply a critical configuration change.
A maintenance reservation has been set in Slurm. Any submitted jobs that cannot complete before the downtime will remain pending until after the downtime.