TSD Operational Log - Page 2
We're experiencing some issues with our data portal, but we're on it and working to get things back to normal. Thanks for your patience while we fix the problem.
Several reservations (including tsd) on Colossus are currently unavailable. The Sigma2 allocation is not affected.
Dear TSD user,
We need to temporarily halt self-service for maintenance. We apologize for the inconvenience.
TSD Team
From 2024-03-20 1730 the first 100 (of 250) submit hosts will be migrated to a new virtualization cluster overnight. This requires the vms to be powered off, migrated and powered on. The downtime per host will be about 5-10 minutes.
Update 2024-03-21: The remaining submit hosts will be migrated from 2024-03-21 1730.
Applicants who use TSD Self Service to apply for membership in a TSD project currently end up in a loop when they return from ID-porten, which prevents them from submitting their application.
We are working on the problem.
[2024-03-18 11:00: update] The maintenance is now over, and jobs are running again.
On Monday 18-03-2024 10:00 there will be a short maintenance stop to apply a critical configuration change.
A maintenance reservation has been set in Slurm. Any submitted jobs that cannot complete before the downtime will remain pending until after the downtime.
The IAM system of TSD will go for a quick upgrade today between 15.00 and 15.15. The services that will not be available during the period are:
1) Selfservice
2) Nettskjema new forms activation
3) Command line UR
We're currently facing storage issues causing input/output errors that are affecting many software applications, including Stata. Our team is actively working on resolving this matter.
[2024-02-21 17:49 update] All affected jobs have been requeued. 85 jobs had to be cancelled, so please inspect the output of your jobs to see if they're affected.
[2024-02-21 15:45 update] Several jobs that were running at the start of the upgrade did not successfully resume. We're trying to resolve the issue. New jobs are not affected.
[2024-02-21 10:35: update] The upgrade is now done, and seems to have gone well.
[2024-02-21 10:00: update] The upgrade has now started
The queue system on Colossus will be upgraded on Wednesday (February 21) at 10:00. During the upgrade, running jobs will be suspended, and slurm commands (squeue, sbatch, etc) will not work. We expect the upgrade to take no more than 20 minutes.
TSD is performing network maintenance (on the DNS service) at 10:00 CET today.
We're experiencing technical difficulties with our project creation service. Our team is actively working on resolving the issue to restore full functionality as soon as possible. We apologize for any inconvenience this may cause and appreciate your patience during this time.
We are currently experiencing technical difficulties with core services, and are working to restore operations. Sorry for the inconvenience caused by this.
Slurm has been restarted on several compute nodes to resolve an issue. Please check the output of your jobs to see if they've been affected.
We're currently experiencing issues with some nodes on Colossus. Jobs on these nodes might have crashed and been requeued. Please check the output of your jobs to see if they've been affected.
Some users are currently facing issues logging in to the Data Portal to export/import data. The specific error message they encounter is "An unexpected error has occurred which may affect the proper functioning of the application." If you also experience this error while attempting to log in to the Data Portal, please notify us by emailing tsd-drift@usit.uio.no.
TSD will be upgrading the storage system, which may cause some instability on the Windows and Linux vms.
We've updated our password policy. This change is part of our commitment to enhancing security protocols and safeguarding sensitive information, taking effect on January 8th, 2024.
All TSD users are now required to update their passwords at least once every year. This practice is essential to maintain a high level of security. You may change your password at any time by logging into TSD's Selfservice Portal: https://selfservice.tsd.usit.no/profile/change-password
You will receive an email notification 30 days before your password expiration date, providing sufficient time for a timely update.
Users with over due password changes will be contacted, with the first group of users contacted December 11th, 2023 and requiring a mandatory password change to be completed by January 8th, 2024.
Accounts that have not complied with the password update requirement by the deadline will be temporarily suspended. Access will be restored upon u...
ID Porten has logging problem, please follow:
https://status.digdir.no/incidents/ctml93xm9lnh
It impacts both TSD and Nettskjema logins
We are currently experiencing some issues with file import through the Data Portal and are looking into the cause of the problem.
This affected TSD systems that relied on NFS.
[Update 08:48]
The core problem is resolved and most systems are up again. We are still investigating the reason for the problems, and some system may still have instability.
[Update 11:00]
All systems should work as normal.
TSD will be upgrading software on the storage system Thursday, 2023-11-30 from 08:00 CET. We expect storage instability on the Windows and Linux vms throughout the day.
Around 10:00 the storage system will be shut down for an estimated 15min, which means network storage is inaccessible on all TSD hosts (Windows and Linux) as well as on our central services (file import/export, etc). To be on the safe side, please close any programs and log off from your vm prior to the downtime.
A maintenance reservation has been set on Colossus from 08:00. This means any jobs that cannot complete before the downtime will remain pending until after the maintenance completes. They'll resume automatically.
Our automation should fix any file system hangs that may occur, and we will be on standby to fix any remaining issues that do not automatically recover.
Apologies for the short notice, we've been in dialogue with IBM to alleviate storage instabi...
TSD will be upgrading software on the storage system tomorrow, 2023-11-17 08:00 - 09:00 CET. Our automation should fix any file system hangs that may occur, and we will be on standby to fix any remaining issues that do not automatically recover. Apologies for the short notice, we've been in dialogue with IBM to alleviate storage instability and want to act on their latest recommendations as fast as possible.
IBM will be upgrading software on the storage system tomorrow, 2023-11-17 07:00 - 09:00 CET. This upgrade is being done on short notice to remove bugs that have caused instability. We are taking the opportunity to improve stability as soon as we can, apologies for any inconvenience. Our automation should fix any file system hangs that may occur, and we will be on standby to fix any remaining issues that do not automatically recover.
Some users are reporting login issues and problem setting one-time codes. We are working to debug and fix the issue.
TSD Internal Publication goes for short maintenance.