Roll out of 2019.3 started for all customers.
07.11.2019 Outage on West Europe region from our IaaS provider Microsoft Azure:
Summary of Impact: Between 02:40 and 10:55 UTC on 07 Nov 2019, a subset of customers using Storage in West Europe experienced service availability issues. In addition, resources with dependencies on the impacted storage scale units may have experienced downstream impact in the form of availability issues, connection failures, or high latency.
Root Cause: Every Azure region has multiple storage scale units that serve customer traffic. We distribute and balance load across the different scale units and add new scale units as needed. The automated load-balancing operations occur in the background to ensure all the scale units are running at healthy utilization levels and are designed to be impactless for customer facing operations. During this incident, we had just enabled three storage scale units to balance the load between them, to keep up with changing utilization on the scale units. A bug in this process resulted in backend roles crashing across the scale units participating in the load balancing operations, causing them to become unhealthy. It also impacted services dependent on storage in the region.
Mitigation: Engineers mitigated the impact to all but one scale unit by deploying a platform hotfix. Mitigation to the remaining scale unit was delayed due to compatibility issues identified when applying the fix but has since been completed.
Next Steps: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
- The fix has already been deployed to the impacted scale units in the region. We are currently deploying the fix globally to our fleet.
- We have been performing cross-scale-unit load-balancing operations numerous times before without any adverse effect. In the wake of this incident, we are reviewing our procedures, tooling and service again for such load balancing operation. We have paused further load balancing actions in this region until this review is completed.
- We are rigorously reviewing all our background management processes and deployments to prevent any further impact to customers in this region.
- We are reviewing our validation procedures gaps to catch these issues in our validation environment.