
On 20 October 2025 a major disruption in AWS’s US-EAST-1 region knocked large parts of the internet offline or into degraded service. The incident, traced by AWS to internal DNS resolution problems that cascaded through EC2 subsystems and DynamoDB endpoints, left household apps, payment services and enterprise platforms contending with failures and a prolonged recovery.
For organisations that rely heavily on a single cloud region, the episode was a stark reminder: cloud convenience does not eliminate operational risk.
At Boston Limited we use events like this to highlight an important truth: resilience is an architectural and operational requirement, not a checkbox. Below we outline the practical steps customers should take, and how Boston can help deliver them quickly and pragmatically.
Many of the failures seen in this outage were amplified because critical services, DNS endpoints or certificates were effectively region-bound. True resilience requires multi-region deployment for stateful services with practical and careful design to avoid “regional single points of failure” - for example by distributing databases, caches and load-balancing endpoints across regions or cloud providers. Boston helps customers design and implement multi-region and multi-cloud topologies, and we provide the hardware and integration services needed for hybrid setups that combine on-prem capacity with cloud fail-over.
Not every workload needs to be active in two public-cloud regions. For many businesses, a hybrid model (combining cloud with on-premises systems) is the most cost-effective and reliable approach. Boston supplies and supports validated on-prem appliances and private-cloud stacks that can be used as warm standby or active fail-over targets, plus orchestration services to switch traffic when a cloud region is impaired. This means mission-critical functions can continue even if a single public-cloud region is disrupted.
The outage underlined how cascading failures, for example a DNS or DynamoDB problem, can ripple across unrelated services. Our architecture reviews focus on identifying brittle dependencies (hard links to single endpoints, certificates stored only in one region, synchronous cross-region calls) and replacing them with resilient patterns: asynchronous queues, retry/backoff logic, local caches and circuit breakers. Boston’s engineers can run dependency-mapping and chaos-testing exercises so you discover weaknesses before they’re exposed by an incident.
Outages are reminders that systems need to be resilient by design. Boston supports customers through tailored hardware solutions, on-site services and expert guidance to ensure workloads are optimally configured, reducing the likelihood of disruption and enabling faster recovery when incidents occur.
Boston Training Academy (BTA) delivers advanced training programmes that combine NVIDIA’s industry-recognised curriculum with Boston’s real-world infrastructure expertise. Alongside courses such as NVIDIA AI Infrastructure Training and the AI Infrastructure & Operations Fundamentals course, participants gain an understanding of how AI workloads operate across cloud and multi-cloud environments. Boston’s consultants also provide strategic guidance on best practices for reliability, scalability and resilience, helping organisations explore approaches to disaster recovery and fail-over planning within their AI infrastructure strategies.
Cloud providers will continue to invest in reliability, but outages will still happen. The right question for any business isn’t whether they’ll be affected, but how quickly they can detect, contain and recover when a key provider has a failure. Boston helps organisations answer that question with technical design, proven products and operational support tailored to each customer’s risk-profile.
If your team would like to review resilience for critical workloads, we can help start with a focused architecture review and an incident-readiness plan. Speak with our sales team to learn how Boston can propel your business to new heights.
To help our clients make informed decisions about new technologies, we have opened up our research & development facilities and actively encourage customers to try the latest platforms using their own tools and if necessary together with their existing hardware. Remote access is also available
The International Conference for High Performance Computing, Networking, Storage, and Analysis