TSD Operational Log - Page 11
There will be a scheduled minor upgrade of PostgreSQL on 25th of June from 08:00 - 09:30.
During this downtime, the applications running PostgreSQL will not work, as we will restart the database in your project. Other services inside TSD will continue working as normal.
Dear TSD users,
selfservice.tsd.usit.no is currently unavailable. We are working on getting it back up again as quickly as possible.
--
Best regards,
TSD
We are experiencing issues with some services, which may lead to some users being unable to login to TSD. We are investigating the cause of this and working on fix.
The DRAGEN node is now accessible on colossus and can take slurm workloads. Please read the updated docs:
/english/services/it/research/sensitive-data/use-tsd/hpc/dragen.html
Abdulrahman @ TSD
We are doing maintenance on TSD login from 16:00 until 18:00 today. During the maintenance new login sessions will not be possible, but active sessions will continue working.
The Colossus file system is having issues at the moment, making the cluster unusable. We are working on fixing it.
Update: The file system is up again. The problems started around 16:15 today, and lasted until 21:00. During that time, it is likely that jobs on Colossus have crashed, so check your results. It is also likely that the problems have caused nfs hangs on the Linux VMs that mount /cluster.
Dear TSD User
We are having some issues with web-based file export and are working to fix the problem. Import is working as expected.
We are experiencing issues with parts of Colossus, which may affect submit hosts. The underlying problem has been fixed, but the symptoms of unresponsive hosts may persist and require a reboot of the VM. Please send a support case if you experience this problem and we will work to resolve it asap.
We are experiencing issues with some services, which affect users that are using /cluster on their VMs in TSD. We are investigating the cause of this and working on fix.
Dear TSD users,
The proxy for the API-services has stopped working. We're looking into it.
--
The TSD team
Dear TSD users!
Colossus is currently running a little slower than usual. We're not quite sure why yet, but looking into the issue.
The reason - most likely - is an overload on the file systems due to two things
- We added a new storage server (96TiBs) and the system is rebalancing
- The rebalancing causes the backup system to detect a LOT of changes and thus it puts a lot of load on the storage system to backup everything.
There is no way around this except downtime and manual balancing which is unwanted by all.
Thanks for you continued patience.
--
Best regards,
The TSD team
We are experiencing issues with some services, which may lead to some users being unable to login to TSD, and get a 504 Gateway Timeout Error. We are investigating the cause of this and working on fix.
https://data.tsd.usit.no is currently not allowing logins. Apart from this services should be running as expected, and we are working on fixing the issue as quickly as possible.
--
Best regards,
TSD
Dear TSD users,
Unfortunately, the machine exporting the network file system to the submit-hosts crashed again, causing these machines to now hang.
As a result, we will have to restart machines which are now hanging to allow you to log in to these machines.
We're working towards getting the issue fixed, and apologize for the frequent crashes around the /cluster mounts lately. Finding a permanent solution to this issue is currently at the top of our priorities.
--
Our sincere apologies,
The TSD team
There was temporary outage for some of TSD services between 15:13 and 15:20 today due to storage issues.
--
Our sincere apologies,
The TSD team
Dear TSD users,
Unfortunately, the machine exporting the network file system to the submit-hosts crashed, causing these machines to now hang.
As a result, we will have to restart machines which are now hanging to allow you to log in to these machines.
We're working towards getting the issue fixed, and apologize for the frequent crashes around the /cluster mounts lately. Finding a permanent solution to this issue is currently at the top of our priorities.
--
Our sincere apologies,
The TSD team
The machine doing the NFS-export of the /cluster/ filesystem crashed Saturday night. This will cause some login-machines to freeze, and make /cluster and modules unavailable to submit-hosts.
We're looking into it.
--
Best regards,
TSD
A network error has prevented jobs from starting on Colossus the last day. The error has been fixed, and jobs are starting as they should again. The error can have affected running jobs, so you might want to check the status of your jobs.
--
The TSD Team
There was a short-lived issue with Windows login, which will have interrupted active login sessions. Apologies for the disturbance, this has now been solved.
ThinLinc is currently not functioning properly.
You might be able to log in and get a session, but it will quickly freeze and stop working. We're aware of the problem and working on fixing it as quickly as possible.
The Windows platform in TSD is still working, and as a workaround, please use this link to login to TSD.
Our apologies for the inconvenience.
--
Best regards,
The TSD Team
2019-05-08
- A bug in qsumm resulting in wrong numbers shown for running or pending has been fixed.
2019-04-24
- A bug in
cost
that made it show incorrect numbers has been fixed. /cluster/projects/pNN/cluster_disk_usage.txt
is now updated nightly again.
2019-04-11
- At least one of the old hugemem nodes are now in production on Colossus 3.
2019-04-10
- MPI-jobs with OpenMPI should work properly now. The preferred way to start MPI programs is with
srun
. The documentation has been updated to reflect this.
2019-04-09
- The cost command has been fixed now
- "srun" in job scripts produced errors like
/var/spool/slurmd/job01122/slurm_script: line 21:...
*UPDATE*
/cluster should now be available on the new p<NUM>-submit.tsd.usit.no hosts. You can reach these by SSH from your login hosts, and from Windows you can access it using the Putty.
Modules are also available on these hosts, so that you can test your pipelines with the new software.
*END UPDATE*
Dear TSD users,
Unfortunately, something has happened with the mounts of /cluster on the new RHEL7 submit hosts created for the new Colossus cluster. We're working on it and will update this notice as soon as it is fixed.
Our apologies for the inconvenience.
--
Best regards,
The TSD team
The new cluster, Colossus 3, is now up, and the old cluster is turned off.
We are experiencing issues with some services, which may lead to some users being unable to login to TSD. We are investigating the cause of this and working on fix.