Active Incident

Updated 11 days ago

This is the official System Status page for M5 Hosting and M5 Cloud and M5 Hosting. Our Standard Maintenance Window is 00:01 to 03:00 PDT (US/Pacific Time) every Sunday morning. Any changes or maintenance that is perceived to have anything but very minimal risk to service will be performed during this window and announced here and to subscribers to this Status Page.

Incident Status

Operational

Components

Cloud Hypervisors

Locations

Cloud Zone "CA1" - (SDTC Data Center)



June 30, 2016 2:12PM PDT
June 30, 2016 9:12PM UTC
[Investigating] At 13:50 Pacific Time 6/30, M5 Cloud experienced a cascading failure of all hypervisors in zone CA1. We are actively bringing hypervisors back online but all VMs will have experienced some downtime. We already have recovered sufficient capacity for all VMs to be gradually coming back online. We will be opening a ticket with you individually if your VM experiences a prolonged outage.

June 30, 2016 2:27PM PDT
June 30, 2016 9:27PM UTC
[Investigating] M5 Cloud is now operating at severely reduced capacity in zone CA1. There is insufficient capacity for all VMs in that zone. This disruption could continue for some time while we investigate and work to restore service. We will post another update in a half-hour.

June 30, 2016 3:09PM PDT
June 30, 2016 10:09PM UTC
[Identified] Zone CA1 of M5 Cloud is 100% offline. We have identified the problem and are working on an interim solution to get capacity back online as quickly as possible. This involves migrating infrastructure to a different network fabric, and could take one or more hours to complete. Please bear with us and watch our status.io page for continued updates.

June 30, 2016 3:46PM PDT
June 30, 2016 10:46PM UTC
[Monitoring] The CA1 zone has tentatively stabilized and VMs are gradually being restarted. We will be manually restarting any VMs not automatically restarted by the system. Please stand by for further updates.

June 30, 2016 4:26PM PDT
June 30, 2016 11:26PM UTC
[Monitoring] At this point all VMs in zone CA1 should be restarted if they were online when this event began. The underlying issue has not yet been resolved, but we are continuing to monitor the situation closely and are working toward a permanent solution.

July 1, 2016 12:11PM PDT
July 1, 2016 7:11PM UTC
[Monitoring] Root cause analysis of the June 30th CA1 zone wide outage indicates potential problems with interoperability between XenServer 6.5 and the current hypervisor Network Interface Cards - NIC(s). In an effort to stabilize the CA1 zone we will be swapping NIC(s) with another model and applying XenServer 6.5 patches during the process. This maintenance is not expected to be service impacting but we would not recommend performing any system modifications during this time in the event of unexpected issues.

July 2, 2016 2:27AM PDT
July 2, 2016 9:27AM UTC
[Investigating] We are experiencing issues with one of the hypervisors as it is applying the patches. As it is in a hung state we are currently rebooting it to rescue the associated resources. Updates to follow shortly.

July 2, 2016 2:50AM PDT
July 2, 2016 9:50AM UTC
[Investigating] A second host is experiencing problems with patching which has expanded the scope of this outage. At this time ~50% of CA1 virtual infrastructure is experiencing an outage.

July 2, 2016 4:21AM PDT
July 2, 2016 11:21AM UTC
[Investigating] At this time the event has cascaded across all hypervisors in the cluster. At this time we are classifying this as a zone wide outage in CA1.

July 2, 2016 5:23AM PDT
July 2, 2016 12:23PM UTC
[Investigating] At this time the outage is affecting the entire CA1 zone. We are currently escalating these issues to our relevant support vendors to try and determine root cause.

July 2, 2016 11:16AM PDT
July 2, 2016 6:16PM UTC
[Investigating] UPDATE -- We are continuing to actively work with vendor support via GoToMeeting in order to get everyone's VMs back on-line. No ETA yet as we're still trying to troubleshoot the root cause. Thank you for your patience. At this point in time, our plan is to fully patch one additional HV in addition to the pool-master, and bring up all the ca1-zone vms on this 2-hypervisor cluster. This will hopefully minimize downtime as we can patch the remaining hypervisors and add them to the pool later.

July 2, 2016 11:18AM PDT
July 2, 2016 6:18PM UTC
[Investigating] UPDATE -- We are continuing to make progress. Working with our vendor support, we have successfully put the hypervisor pool-master back on-line and have patched XenServer to the latest update version. Our plan now is to fully patch one additional HV in addition to the pool-master, and bring up all the ca1-zone vms on this 2-hypervisor cluster. This will hopefully minimize downtime as we can patch the remaining hypervisors and add them to the pool later.

July 2, 2016 1:03PM PDT
July 2, 2016 8:03PM UTC
[Investigating] UPDATE - The CA1 zone in back online. Given the extended outage and related break-fix work the CloudStack orchestration platform lost the state associated with running infrastructure. We are currently manually starting customer virtual machines based on state information collected prior to the maintenance effort.

July 2, 2016 3:30PM PDT
July 2, 2016 10:30PM UTC
[Monitoring] UPDATE - At this time maintenance activities are complete and CA1 is fully operational. All customers have been sent tickets individually identifying affected infrastructure. If you are having any issues at this time please follow up using the associated ticket thread.

July 3, 2016 12:24AM PDT
July 3, 2016 7:24AM UTC
[Investigating] UPDATE - At this time there is a large scale outage in the CA1 zone. Some customer virtual infrastructure is down where others is running with read-only disks. We have identified the root cause being performance degradation with the CA1 storage array and are actively working with the vendor on a resolution path. Please note there is no risk of data loss. At this time we are estimating an outage window of a few hours. In the meantime if you need infrastructure we recommend deploying in our CA2 zone as all systems are healthy. We do note that unfortunately data migration from the CA1 to CA2 zone is not currently possible. System status updates to follow.

July 3, 2016 2:52AM PDT
July 3, 2016 9:52AM UTC
[Identified] We believe we have found a solution to the trouble and are working to restore all services & customer instances at this time.

July 3, 2016 3:58AM PDT
July 3, 2016 10:58AM UTC
[Monitoring] We have restored all customer VMs at this time and are closely monitoring. We believe that our storage vendor finally found the core root issue we have been experiencing so we have faith that we should not continue to experience trouble in this zone at this time. We will follow up with details and root-cause analysis at a later time.

July 3, 2016 5:21AM PDT
July 3, 2016 12:21PM UTC
[Investigating] While we thought we had contained the issue, we have been alerted that another of our HVs and it's hosted VMs has disconnected from the cluster. Currently trying to get everything back online ASAP

July 3, 2016 6:26AM PDT
July 3, 2016 1:26PM UTC
[Monitoring] UPDATE - Al hosts are online and the system is fully functioning. We are continuing to monitor.

July 3, 2016 10:45AM PDT
July 3, 2016 5:45PM UTC
[Monitoring] UPDATE - Zone Ca1 is stable and functioning. We are continuing to monitor.

July 3, 2016 8:10PM PDT
July 4, 2016 3:10AM UTC
[Investigating] We received an alert that another Hypervisor rebooted without warning. Some VMs on this HV rebooted and some didn't automatically reboot when the HV came back up. Currently investigating...

July 3, 2016 10:54PM PDT
July 4, 2016 5:54AM UTC
[Monitoring] The last update was technically correct, but because it was sent in relation to a previous incident of a broader previous outage in CA1, we'd like to offer this is a clarification: A fault in one hypervisor host caused the VMs on that one host to reboot on to other host in the same cluster. This is by design, and it operated as designed and in a manner to provide the best uptime possible in the event of a hardware fault. For the affected customers, the recovery was very fast. However, not all reboots are always perfect. In one customer case, a single VM didn't boot up automatically because a recently used virtual CD image was in the virtual virtual CD drive. This delayed recovery of that one VM (about 30 minutes). We currently have 300% more capacity than required in CA1 to run customer VMs and redundancy features are all functional at this time. We appreciate the sensitivity of multiple events recently. It is with that in mind that we respectfully offer the above clarification.

July 5, 2016 2:00PM PDT
July 5, 2016 9:00PM UTC
[Investigating] Recent outages in the CA1 Cloud Availability Zone have been caused by slow completion of commands in the cloud orchestration layer. This included the commands that move VMs from one hypervisor host to another when there are failures or scheduled maintenance on a hypervisor host. When a host went down, the commands were queued up and ultimately timed out. This caused migrations to fail and VMs to stop. This issue was resolved at ~02:30 PDT July 3rd. Commands are completing very quickly and reliably. While this the slow command issue is resolved, the initial issue of hypervisor hosts rebooting persists. Until this is resolved, you may experience reboots of your VM. It should quickly come back up on another hypervisor host, as it is designed to do. We continue to troubleshoot the anomalous reboots of the hypervisor hosts. On average each host reboots about 1-2 times per day. There many more hosts than are required to run all VMs. Our own very experienced engineering team is working with hardware and software vendors to resolve the remaining issues as quickly and completely as possible. If the anomalous reboots are impacting your services to your users and you need an immediate resolution, our CA2 Availability Zone is available and it not affected by the rebooting issues. We can help you snapshot and move your VMs to CA2. Please let us know if we can help. A migration to CA2 will require a different IP address and you will need to change your DNS records to reflect that. Please let us know how we can help you via email or the customer portal at https://service.m5hosting.com. Thank you for you patience and your trust.

July 7, 2016 2:02AM PDT
July 7, 2016 9:02AM UTC
[Investigating] We've debugged, reconfigured, worked with paid vendor support for every involved system integrated in to the CA1 Availability Zone in our cloud service. We have found bugs and made fixes. However, the anomalous failures of the hosts persist. On Tuesday night we implemented a bug fix that we were certain would be the resolution. However, shortly after 5:00am PDT Wednesday, a host crashed. A few others have followed suit throughout the day on Wednesday. The periodic crashes first began when we upgraded the hypervisor software in CA1 from Citrix XenServer 6.2 to 6.5. Those crashes revealed other bugs in both hardware and other software. Those other bugs are resolved... but still with the crashes. We do run XenServer 6.5 successfully in other parts of our infrastructure without crashes, on systems where the only difference is the speed and number of cores in the CPU. XenServer 6.5 is not brand new so bug fixes and a service pack are available. But, for reasons we have not yet been able to determine, things are not stable with XenServer 6.5 on the the same cluster hardware that has been very stable, running your VMs for 2 years. We believe the best option at this point is to completely swap the host hardware out entirely and build a new cluster. We have built a new cluster (named "c2") of hardware that has been known good and healthy while running XenServer 6.5 elsewhere in our infrastructure. We have added this cluster to the CA1 zone. We have put some VMs on it. We should know soon if it suffers the same issues as the original cluster. As a backup plan, we are also building yet another cluster (c3) with newer, different model hardware. This will be used if the second cluster (c2) is not stable. If the second cluster (c2) is stable, your VM will be migrated to it. It will still be in the CA1 Availability Zone (physically different data center, network, storage, and power, located 13 miles away from CA2). This work is already under way. We will update http://status.m5hosting.com as events require and to assure you that we are not resting until this is resolved. If you want to share your thoughts, have questions or suggestions, please let us know. That's not a platitude. We want to hear from you. Please call or email at any time. We sincerely appreciate your patience.

July 7, 2016 11:47PM PDT
July 8, 2016 6:47AM UTC
[Investigating] It's been 30 hours and no reboots! As planned and indicated in our last update, we have built a new cluster on XenServer 6.5. We have been burning it in and stress testing this new cluster for 12 hours so far. A third cluster will be ready shortly (also on 6.5). We will soon be ready to live-migrate customers to the new clusters. In most cases, this will not require any effort, changes on your part, or cause any downtime. This work is already under way. We will update http://status.m5hosting.com as events require and to assure you that we are not resting until this is fully resolved. If you want to share your thoughts, have questions or suggestions, please let us know. Please call or email at any time. We sincerely appreciate your patience.

July 9, 2016 2:56AM PDT
July 9, 2016 9:56AM UTC
[Monitoring] It has been more than 54 hours since the last unplanned reboot of a cloud hypervisor host in CA1. Nonetheless, we'd like to totally eliminate the flaky systems from production entirely. We have built and tested two additional clusters (c2 and c3). Each of the new clusters are currently 200% the capacity required to run all customer VMs in the CA1 zone. This means that even if we encounter a problem like this in the future, we'd be able to isolate it to a portion of an Availability Zone ("AZ"), rather than it affect the whole AZ. This is an important architectural change. We will also add a cluster to CA2, and build future AZs with multiple clusters. As we scale up at each AZ, we will add moderately sized clusters. There may not always be 4x as much capacity as needed in each AZ as there is now, or even 2x, but the ability to isolate cluster-wide problems to a fraction of a zone will improve our ability to mitigate the impact if such a challenge happens again. Each cluster will continue to be at least "N+1" hosts; able to lose 1 host and still have capacity to run all VMs assigned to it. The experience of crawling through every detail of the infrastructure in CA1 did uncover some bugs and smaller issues that needed to be corrected. Many of those have been corrected, and others are on a new to do list. Right now, so many things have been improved that were really are excited about the system again. If you have VMs running in CA1, we will be in touch with you within the next week about the migration of your VMs to one or both of the new clusters. There really isn't anything required of you. The migration can be done without a reboot or any kind of interruption. You can keep the same IP addresses and no reconfiguration is needed at all. However, we consider it "maintenance", so we'd like to make sure you know when it's happening or we can schedule it to happen at a time that is of lowest risk or lowest load for you, just for extra safety. We can also answer any questions you have and address any concerns you may have. With all 3 clusters in CA1 stable at this time, there isn't any urgency to move right away. However, we will retire the original cluster shortly and that will require that you are migrated off of c1 and on to c2 and/or c3. Thank you very much for your patience through these issues. We'd appreciate hearing from you about this message, the changes and work described, any questions or comments, or even just to say "Hi!". Sincerely, Mike

M5 Hosting Load Balancing Cluster




Operational

M5 Spam Filter Cluster




Operational

Access Network System (one or more racks)




Operational

Core Network Systems




Operational

Electrical Systems




Operational

Environmental Controls




Operational

M5 Support Portal (service.m5hosting.com)




Operational

M5 Hosting Legacy Cloud




Operational

M5 Cloud Manager Web Interface (manage.m5cloud.com)




Operational

Cloud Hypervisors




Operational

Cloud Primary Storage




Operational

Cloud Secondary Storage




Operational

M5 Cloud Website




Operational

M5 Hosting Website




Operational

External Services

HipChat

Mandrill

SendGrid

Locations