Norwegian version of this page

TSD Operational Log - Page 10

Published Mar. 31, 2020 8:54 AM

Since Saturday evening, there has been some instability with the Colossus file system, which affects all submit hosts and running jobs on Colossus.

Update:

The file system was up again at 09.44. All systems should work as normal, but please inform us if you have problems.

[COMPLETE] Colossus maintenance from March 23rd. to March 27th.

Published Mar. 21, 2020 11:39 PM

As a step in the process of getting the new storage in production, we will restrict access to Colossus from Monday March 23rd. to March 27th.

During this time, there will be no access to Colossus, and the /cluster storage, including the project areas and software modules. You can also not access these services from the TSD's virtual machines.

The HNAS areas ("durable") will not be affected by this downtime.

We will update this page with the progress of our work during the maintenance window.

[SOLVED] Dragen node in Colossus down for maintenance

Published Mar. 20, 2020 9:37 AM

Bio-IT processor software upgraded to v3.5.7.

[SOLVED] Issues with SPSS license

Published Mar. 12, 2020 12:07 PM

Our TSD-users are currently experiencing issues with SPSS license. We are working on resolving this issue, but it might take a couple of working days to fix it.

[SOLVED] https://view.tsd.usit.no is down.

Published Mar. 3, 2020 1:03 PM

UPDATE: This was solved, 2020-03-03, 13:39

VMware Horizon is currently down. This means that access to Windows virtual machines through VMware Horizon and https://view.tsd.usit.no is currently unavailable.

We are working on solving this as soon as possible.
Our apologies for the inconvenience.

--
Best regards,
TSD

[SOLVED]: /cluster/ and modules unavailable to Linux VMs.

Published Mar. 2, 2020 3:10 PM

The machine exporting the /cluster filesystem crashed, causing hanging mounts on machines which mounts the /cluster file system.

We're working on solving the issue.

--
Best regards,
TSD

[SOLVED]: Issue with ssh between project VMs

Published Feb. 3, 2020 9:58 AM

We are working on fixing an issue affecting ssh between project VMs. While the issue persists, you may experience trouble accessing your Colossus submit host.

[SOLVED] Login Problems in TSD

Published Jan. 31, 2020 1:44 PM

TSD-users cannot log in and we are investigating the cause of this and working on a fix.

[SOLVED] Issue with Colossus export

Published Jan. 20, 2020 8:39 AM

We are having trouble with the Colossus NFS export, and are working to solve it.

[SOLVED] Colossus file system outage, and quick maintenance.

Published Jan. 15, 2020 9:20 AM

UPDATE: Maintenance is done, and all exports of /cluster should be back to normal as of 12:58, 15-01-2020.

Due to the twice occurring crashes of the file system so far this week, we will be taking down the file system again today at 12:00, 15-01-2020 for quick maintenance.

As a result of this, /cluster will become unavailable on submit-hosts and other project VMs which mount /cluster. However, HPC-jobs running on Colossus itself will not be affected.

[SOLVED] Short outage on export of /cluster to project machines.

Published Jan. 13, 2020 3:31 PM

The machine responsible for making /cluster on Colossus available to the project machines in TSD crashed at 15:15 13-01-2020.

The services are now back up and running as expected.
For most projects this should not impact regular operations, however could create problems for projects which frequently access /cluster from their virtual machines.

We are currently checking all projects and working on getting everything back in order for the projects still affected by the outage.

--
Best regards,
TSD

SPSS is displaying a warning for license expiry

Published Jan. 6, 2020 11:30 AM

SPSS is displaying a warning for license expiry. Please ignore this message. The problem will be solved soon.

[SOLVED] QR code generation on self service portal not working

Published Jan. 3, 2020 1:27 PM

We are working to solve the issue.

[Solved] Self service is going down for a short maintenance

Published Dec. 12, 2019 12:24 PM

[Solved] Short network outage

Published Dec. 9, 2019 3:19 PM

We had a short network-outage due to firewall-updates. The change has been reverted, and everything should be operational again.

[SOLVED] Issues with changing password in selfservice portal

Published Nov. 27, 2019 12:33 PM

We are experiencing some technical issues with our selfservice portal, and the users currently cannot change their password as a result. We are investigating the cause and working on fixing the issue as soon as possible.

[Solved] Self service is going down for a short maintenance

Published Nov. 12, 2019 1:48 PM

[SOLVED] Short stop of Colossus file system

Published Nov. 7, 2019 12:11 PM

The /cluster file system on Colossus crashed around 11:00 today. It was restarted at 11:30. We are investigating the reason for the crash.

This probably affected running jobs on Colossus, so you should check your jobs.

It also affected the NFS-mounted /cluster file systems on the Linux VMs that mount /cluster. The mounts should be fine now, but please report any hanging mounts.

[COMPLETED] Windows maintenance

Published Oct. 29, 2019 9:13 AM

There are a few remaining projects which need help with printing and GPUs, but the rest of the work is completed.

~~We are busy with Windows maintenance, which will cause interruptions to login sessions throughout the day.~~

[SOLVED] tsd-fx03 currently unavailable

Published Oct. 10, 2019 12:48 PM

We are experiencing some technical issues with tsd-fx03, and the service is currently unavailable as a result.

We are investigating the cause and working on fixing the issue as soon as possible.

[SOLVED] BankID mobile unavailable for Telenor customers

Published Oct. 8, 2019 2:55 PM

Difi has notified us that they are experiencing issues with BankID mobile for Telenor customers, which therefore may affect selfservice login for some TSD users.

[MAINTENANCE] Network maintenance: 2019-10-08 10:00

Published Oct. 8, 2019 9:37 AM

Dear TSD User

We will perform some internal network maintenance at 10:00. We do not expect any interruptions to services, but please let us know if you experience any issues.

UPDATE:

The problem should be resolved.

~~UPDATE:~~

~~We are experiencing problems with access to view.tsd.usit.no. We are working on resolving this.~~

[COMPLETED] Colossus maintenance on Thursday, 3rd October

Published Oct. 3, 2019 9:33 AM

UPDATES ON ONGOING MAINTENANCE:
09:30: Unmounting /cluster on all machines.

11:00: New machine is up, currently running tests.

12:15: Starting up services on project machines to allow access to /cluster again.

12:50: Colossus services and /cluster exports running as normal, now with 10 times the bandwidth.

[SOLVED] /cluster/ and modules unavailable to Linux VMs.

Published Oct. 2, 2019 4:10 PM

The NFS-exporter for Colossus crashed again, on the brink of our planned maintenance and switch to the new machine tomorrow morning.

We've restarted the services now, and will be restarting the machines which are now hanging due to this promptly.

Our apologies for the inconvenience.

--
Best regards,
The TSD Team

[COMPLETED]?Colossus maintenance on Thursday, 3rd October

Published Oct. 1, 2019 1:32 PM

UPDATE, 09:30, starting unmounting of NFS-shares from /cluster on all machines.

We have now solved the problems we encountered on Monday, and are now ready to replace the NFS-exporter.

The work will start on Thursday 3rd October at 09:00 CET. We expect to be finished by the end of the day, possibly earlier.

During the maintenance, we have to unmount /cluster on all virtual machines (VMs) that mount it. This means that the /cluster/projects/pXX areas will be unavailable on the VMs, and it will not be possible to use the module load system for software on the VMs. Some VMs might also require a reboot.

Jobs on Colossus will continue to run as normal, but it will not be possible to submit new jobs during the stop.

Do not run jobs on VMs that need data from /cluster or software modules. If you do so, we will have to kill them to unmount the /cluster area. Also, if the VM needs to be rebooted, all ru...