Norwegian version of this page

TSD Operational Log - Page 18

Published Jan. 27, 2016 9:08 PM

Dear TSD users

Sorry to announce that we have some network issue in TSD causing windows computers not to see storage and thinlinc not working.

We are on the case, updates will come.

Best

Gard@TSD

Published Dec. 18, 2015 10:46 AM

Hi,

the NFS problem that was causing the freezing of the opened sections in TSD has been identified and possibly solved. The TSD service shall be stable and opened up again  for production today. More news soon.

Sorry for the inconvenience

Regards,

TSD@USIT

Published Dec. 17, 2015 5:49 PM

Dear TSD-user

the problem occurred today was solved by replacing the failing switches with two new ones. Almost all the components have been moved to the new switches. The moving will continue tomorrow as some reconfiguration of the internal network may be needed. At the moment the windows machines shall be up and running and users can log into their VMs  via PCoIP and ssh+RDP. However some instability might be expected, since we are still reconfiguring the internal network. Linux VMs might still hanging and we are working to resolve this problem too. 
Please follow the update on the operational log on the TSD webpage. 

Sorry for the inconvenience

TSD@USIT

Published Dec. 17, 2015 10:14 AM

Dear TSD-users,

unfortunately we are having a network failure and this has caused a strong instability since yesterday 16/12 at 18:00.  We need to shift to new routers and this implies an unscheduled downtime now. We do not know at the moment how long the outage will be but we are working very hard to get the problem solved as soon as possible during the day. 

Sorry for the inconvenience,

Regards,
Francesca

Published Dec. 17, 2015 8:18 AM

Hi

We do still see glitches in the network due to either some infrastructure failing or the power being unstable.

We are working on this, sorry for the inconvenience.

TSD@USIT

Published Dec. 16, 2015 10:16 PM

Hi

we have had a power or a network failure that caused an unplanned reboot of many components in TSD, including several project VMs. The situation is stabilized and the machines are normally running now. We are investigating the causes of this important failure.

We apologize for the inconvenience.

TSD@USIT

Published Dec. 16, 2015 9:05 PM

Hi

It seems as we have had a power or a network failure inside or outside TSD which caused 100+ machines to reboot and some or the services has stalled. We are working on it.

TSD@USIT

Published Dec. 14, 2015 9:52 AM

TSD. WMware service is up and running now.

Sorry for the inconvenience. 

 

TSD@USIT

 

Published Dec. 14, 2015 9:06 AM

Hi

Some patch or unknown factor has taken down our TSD view server yesterday. We are working full time on fixing it. Hope to have the service back very soon.

Best

TSD@USIT

Published Nov. 30, 2015 3:16 PM

Dear TSD-user,

the update of the VMWare security server has been successfully completed. All the windows VMs are now again accessible with the PCoIP protocol.

Enjoy TSD!

Regards,

Francesca

 

Published Nov. 30, 2015 11:11 AM

Dear TSD-user,
on Today the 30/11-2015 between 13:00 and 15:00 CET we will upgrade the VMWare View security server. During the upgrade the login to the windows machines in TSD via the PCoIP protocol will not be available. Login to the windows servers will be therefore only possible vis ssh+RDP connection (http://www.uio.no/tjenester/it/forskning/sensitiv/hjelp/brukermanual/ssh-og-rdp/index.html). However be aware that the ssh+RDP connection will only work if you do “Log off" from your last session opened with PCoIP. 
The windows and linux VMs will not be affect by the upgrade and the processes running on the machines will keep running. Jobs on Colossus will not be affected.

Regards,
Francesca

Published Nov. 20, 2015 2:34 PM

Dear Colossus User

the maintenance has been successfully completed and the cluster is up and running. The hugemem nodes still need to come up, and probably will not be available until Monday next week. However all the jobs that were queuing during the downtime are already running.

Happy computing!

Francesca
 

Published Nov. 18, 2015 10:22 PM

Today (19/11) from 8:00 am Colossus will be stopped for maintenance. The outage shall last for two days.

Francesca

Published Nov. 5, 2015 4:24 PM

Dear TSD-user,

the maintenance stop of Colossus was successfully complete and the cluster is back in production. 
As previously informed, there will be one more downtime the 19 Nov 2015 from 8:00 am. The downtime will last at max two days. This second downtime is needed to complete the work initiated now, namely setting up a new configuration that will significantly improve the I/O in the cluster. 

Please notice that if you schedule a job with running-time longer then 14 days, then the job will not start before the end of next downtime.  
Happy computing!

Francesca@TSD
 

Published Nov. 4, 2015 12:27 AM

Dear TSD-users,

as anticipated two weeks ago the Colossus will be stopped today (4/11-2015) from 8:00 a.m. Except drawbacks, we will expect to finished the downtime by Thursday afternoon. You will be notified when the service will be on again.

Regards,

Francesca@TSD

Published Nov. 4, 2015 12:22 AM

Dear TSD-users,

there will be a maintenance stop of the TSD infrastructure on Thursday 5/11 from 15:00 to 15:30 CET. During the downtime the users will not be able to access TSD. The VMs will be probably rebooted at the end of the downtime therefore all the running process will be stopped. The TSD downtime coincides with the Colossus maintenance stop, so there will be no jobs running on the cluster at the time of the downtime. The short notice is due to the fact that we have decided to merge two maintenance stops, namely HNAS and Colossus, to minimise the numbers of outages.

The downtime lasted from 1500 to 1510, and everything is back up and good. Performance should be better.

Sorry for the inconvenience.
Regards,
Francesca

Published Oct. 20, 2015 10:02 PM

Dear TSD-user,

tomorrow there will be an upgrade of Cerebrum instance in TSD. The outage will last for the entire day. As a consequence of the maintenance stop the brukerinfo will not work. 
You will receive an informative email when the maintenance is finished.
Sorry for the inconvenience. 

Regards,

TSD team

Published Sep. 16, 2015 3:58 PM

Dear TSD-users,

the issue regarding the missing communication between Colossus and the Domain Controllers has been solved and now Colossus is back in production as usual. 
We expect that very few (if none) jobs had failed during the unplanned outage. 

Sorry for the inconvenience.

Happy computing!

Francesca@TSD

Published Sep. 16, 2015 1:50 PM

Dear TSD user

We got a problem with Colossus because of our Domain Controller update made an unwanted situation pop up. We are hoping to get the situation back on track today, we?ll keep you posted.

For those of you paying for CPU hours and having had jobs killed, please email us at tsd-drift@usit.uio.no to get this refunded with interests.

Sorry for the inconvenience.

Gard

Published June 8, 2015 10:38 AM

Dear TSD user,

the 10th June 2015 at 12:00 CEST there will be an update of the TSD disk. We expect the upgrade to last for a hour. During this period the system might hang up for circa one minute at intervals of every 30 minutes (the first time at 12:00, the second time at 12:30 etc).

For the Colossus users: all the jobs on Colossus nodes will keep running as usual during the upgrade. Notice however that jobs finishing during the upgarde might crash because of the failure of data writing processes back on the VMs. We therefore advice you to schedule (when possible) your jobs in order to finished well after the upgrade period.

Regards,

TSD@USIT

Published May 18, 2015 1:34 PM

Dear TSD-user,

the linux vms will be shut down for maintenance purposes, as announced previously.

You will be informed when we will finish.

Regards,

Francesca@TSD

Published May 18, 2015 12:46 PM

Dear User,

the login problem we experienced this morning has just been solved.

Regards,

TSD@USIT

Published May 18, 2015 10:55 AM

Dear Users

We have an issue with the two factor login. Problem occures second time you try to log in today. We are on the case, hopefully solved quite soon.

Best

Gard@TSD

Published May 11, 2015 10:16 PM

Dear TSD user,

the maximum wall-time-limit for the jobs running on Colossus has been now increased to 28 days. This will facilitate the execution of long  simulations/calculations. However we strongly advice you not to run jobs for more than 7 days, unless you have enable checkpointing (...

Published May 11, 2015 10:02 AM

Dear TSD user,

due to maintenance, the 18th May 2015 at 13:30 CEST all the linux machines in TSD will be shut down. We expect the downtime to last for a hour. However you will receive a notification by mail when the maintenance stop is over.

For the Colossus users: all the jobs on Colossus nodes will keep running as usual during the downtime. Notice however that jobs finishing during the downtime might crash because of the failure of data writing processes back on the VMs. We therefore advice you to schedule (when possible) your jobs in order to finished well after the downtime period.

Regards,

TSD@USIT