A recent outage at Cloudflare, a content delivery network used by many companies, took down multiple websites across the globe. The reason behind this outage was a change that was part of a long-running project to increase resilience in the busiest locations. A change to the network configuration in those locations caused an outage.
In a recent fireside chat with YourStory, Venkatesh Sundar, Founder, and Chief Marketing Officer, Indusface discussed the ways of dealing with outages or failures and minimising their impact.
“It happens to anybody, even with the best in the business, independent of size or scale, since we are dealing with technology, and a complex set of components trying to ensure us with services. Thus, failures and outages are a part and parcel of this business. But the focal point has to be on learnings and how to deal with those failures better and minimising their impact,” said Venkatesh.
Key lessons and learnings
According to him, quite a considerable amount of time is spent on the negative aspects and downtime after an outage or failure, rather than dedicating time to finding the root cause. “I feel that after any outages discussions should be more on better ways for coping and minimising it. When outages occur, we need to learn from them and figure out ways of reducing their impact by applying design for failure concepts so that they can deal with failure in a better way. When someone is providing a service, and if that goes down, it should not impact anything else outside the service value add that is being provided,” explained Venkatesh.
“Of course, outages are a part of the business, But when the downtime happens, one has to also have capabilities in place to restrict its impact only to the services being provided by the component that went down and nothing more, and that’s our underlying philosophy at Indusface,” he said.
Cloudflare service is essentially a reverse proxy that acts as an intermediary for all the internet traffic. When this entire stack goes down if CloudFlare had mechanisms in place to have traffic automatically routed towards the application it would not have brought down the sites. This was not done and took down all the business sites along with their outage” added Venkatesh.
So, what’s the role of Indusface in this scenario? Here, Venkatesh added, “We also are a reverse proxy providing security and acceleration services to many critical businesses similar to the ones mentioned in this news. A significant amount of our engineering muscle and innovation was put in from Day 1 for ensuring that if we go down we do so without the customers being impacted. There is monitoring in place every few seconds to check that the site is available. If it is not available and it is due to the customer site being down, we inform the customers to fix it. But if it is not available because we are down, then our auto-bypass feature kicks in and isolates the impact of the outage by keeping it restricted to only our services not being available at that time. This is one of the most engineered features that we would never want to use, as its usage means our failure. But, paradoxically, we also wear it as a badge of honour as a saviour as when we go down we do not take our customers’ sites down with us.”
He also admits to failing, and every time they use it feels like one more time than needed and should not happen again. “As engineers, by default, the thinking is to build for capability and scale, and we have to force the discussion around the consequences of downtime. So, this outage should act as a trigger for discussion around that when the downtime happens to discuss upfront what would be the areas of impact and ways of minimising it. We feel proud that we successfully have those aspects of service isolation of outages covered in our offering .”
Closing thoughts
Venkatesh would like to encourage every vendor, including Indusface’s existing customers to keep pushing them hard, not just on capabilities and differentiation, but also to discuss the worst-case scenario and the business impact of it, as designing for failure has to be the underlying thought process to mitigate and manage failures in a better way.
“It’s not just software vendors, engineers, or technology vendors, to put that engineering muscle into place, but also the customer buying it should have those discussions in place, on what would happen if things go down, what are the checks and balances, while minimising the surface area of outages are in place. Hence the world will be a better place, and headlines like these would reduce in numbers. Yes, it’s boring, but boring is good in this business,” he added.