TSD Operational Log - Page 8
Today we are upgrading VMware Horizon, and as such it is not possible to log in to Windows VMs.
There was a problem with a service related to changing QR-codes, which caused users to be unable to change their QR code between 09:15 and 14:00.
Yesterday, between 14 and 21, many jobs failed to start due to a problem with the scratch file system. These jobs have been requeued now, and should start as normal again.
We are still trying to figure out what the cause was. The indications so far is that the filesystem got full, either in terms of disk space or number of files. If that is the case, jobs using $SCRATCH can have been affected or even crashed, so please check your jobs.
Update, 2020-09-27: We have confirmed that it was one or more jobs that filled up $SCRATCH, in the sense that they created too many files. We are setting up monitoring to be able to find out which user's jobs are responsible should it happen again.
We are fixing issues with Windows login at view.tsd.usit.no
Many Windows hosts ended up in an inaccessible state after automated upgrades over the weekend. We are currently getting the hosts back up, and will make adjustments to avoid this issue from reoccuring.
Due to maintenance on the Colossus compute cluster, the queue system (Slurm) commands (sbatch, squeue, etc.) will be unavailable for a couple of minutes. This will happen a couple of times today. Running jobs on Colossus will not be affected. Nothing else on VMs will be affected (for instance, access to project areas and software modules).
We will have a short stop maintenance of selfservice between 13.00 and 14.00 today 11/08/2020.
Best TSD Team
We will have a short stop maintenance of selfservice between 13.00 and 14.00 today 11/08/2020.
Best TSD Team
We are currently having issues related to changing user account passwords in TSD. We're working to resolve this as quickly as possible.
--
Best regards,
TSD
We're currently having some trouble with access to the Colossus storage. We're working on solving this as quickly as possible.
Unfortunately, this will cause login problems for some of the machines in projects which are connected to Colossus.
--
Best regards,
TSD
Update: Dragen has been updated to CentOS7 and licenses have been renewed starting August 1 2020 till July 31 2022. Access to Dragen has been revoked for all projects except p22. Access to Dragen can however be requested by sending an email to TSD.
We're upgrading Dragen to CentOS7 and installing the new filesystem. We will update the log when its back online.
The maintenance is starting at 14.30 and will last no more than 15 minutes.
There is an issue with password and QR code reset which we are fixing now.
The colossus file system is being upgraded from 10 - 24 June. During this time, no jobs can be run.
We are debugging and fixing the issue.
Colossus storage is currently down. We are addressing the issue, working to get jobs running asap.
TSD is having a network maintenance from 07:00 - 09:00 CET, and there will be interruptions to services during this period.
Colossus is currently down due to a crash in the cluster file system. We are working to resolve the issue. Submit nodes are also down as a side effect of this ongoing issue.
The PyPI, CRAN, STATA etc mirrors are down. We working on bringing them back online.
Some compute nodes are down at the moment, causing jobs to be re-queued. We are working to bring them back online.
There are currently problems with the file-export feature on data.tsd.usit.no
Update:
The issue should now be resolved.
Currently, some users are not able to log in to TSD via ThinLinc. We are working on fix.
There's a license upgrade issue.
Update: the new licenses have been installed.