Push Comes to Shove

Posted by Posted on by

In 2008, I wrote a post Why We Compete with Google, in response to the persistent question at that time “How can you guys survive Google in the online office space?”

My basic argument was that business software, due to its extensive sales and support requirements, simply does not have the productivity or profitability of consumer internet businesses, and could never produce the margins that Google enjoys in search or what Facebook enjoys in social networking today or for that matter how today’s margins at over-the-hill Yahoo and getting-there eBay compare to the still-hot Salesforce. I concluded that post (keep in mind it was written in 2008) with:

When push comes to shove – and there is a lot of very messy push and shove in the business software market –  Google’s resources are going to flow into figuring out how to monetize the humongous traffic of YouTube or compete in online auctions, rather than figure out a way to squeeze a bit more margin compared to Oracle or Adobe or Salesforce. That may explain why Google has been silent on CRM, Project Management, Invoicing or HR type of tools, because those markets don’t offer the profit potential they already enjoy.

Well, today, the Wall Street Journal has a well-researched post by Clint Boulton, Google Organizational Changes Cloud the Future of Apps. On purely strategic grounds, the enterprise business is very sales and support intensive, and it does not have the potential to offer Google-y margins. Microsoft achieved its 90% operating margin due to its monopoly pricing power, and the emergence of the cloud and mobile devices, in part thanks to Google’s role, has eroded that pricing power. So it was always clear, to me at least, that Google was in this game to make sure Microsoft does not have the infinite cash to keep throwing money at search. As Google’s strategic threat from Microsoft fades into the rearview mirror while Facebook emerges as a strong potential threat, I predicted that Google would naturally lose interest in what remains a fundamentally inferior business to their own core business. Senior executives in Google in charge of the Apps business seem to be reading the tea leaves, as Boulton reports. There have been a slew of executive departures and other organizational changes in the Google Apps division.

Signs are emerging that Google is de-emphasizing its efforts in online productivity tools that compete with Microsoft, which was never the core of its business to being with, to focus even more on search and social networking, and its increasing competition with Facebook.

Google Apps has had some churn to its core leadership as the company evolves under CEO Larry Page, including the loss of Dave Girouard as vice president of Apps and president of Google’s Enterprise business. Girouard, who joined Google in 2004, oversaw the development and launch of Apps for businesses. He left April 6 and no successor has been named.

A source familiar with Google Apps told CIO Journal: “I was personally shocked to see Dave G leave. That was his baby, and he was so invested in it.”

At Zoho, of course, we have patiently been investing in R&D, while building our business for the long haul. We came to the conclusion a while ago that ad-driven consumer internet business is a poor fit for a business-focused suite of apps. In an ad-driven business, the users are the product, to be packaged and sold to advertisers. When you ask someone to directly pay for something, as a visitor or a user becomes a customer, the very nature of the engagement changes.

That’s why since the very beginning we have never funded our business through advertising. We’ve made a commitment to our users that we’ll never display ads, not even in our free products, and that we’ll never sell their data to a third party so that they can be “better targeted”. We’ve understood since the beginning that advertising and business applications just don’t mix.

We will continue to execute to provide a compelling, ad-free cloud experience in our Zoho suite of apps, particularly Zoho Mail and the Zoho Office suite. We respect the engineering prowess at Google, and indeed, we will continue to actively participate in the Google Apps marketplace, but ultimately whether a business makes sense or not is not an engineering question alone. In that sense, I am not at all surprised that someone high up at Google looked at the business question of the Apps suite, and came to the conclusion that was obvious from the start.

Our deep, existing integration into the Google Apps suite makes it really easy to migrate customers from Google Apps. We welcome Google Apps users to Zoho, and we are very happy to provide migration free of charge!

Now, what about the other elephant in the room? Microsoft Office365 is going to be a formidable player, but so far, their execution in the cloud and in mobile is less than terror-inducing, to put it mildly. Our Mail & Office suite are written from the ground up to be in the cloud, on tablets and smart phones. We will continue to invest in R&D to make them stronger, and, unlike Microsoft, we will provide first-class support to all the devices out there, including the iPad, the iPhone and Android-based devices. Of course, we will also continue to integrate our Mail & Office suite with our other business apps, including our rapidly gaining CRM.

Zoho CRM Read-Only Mode Available From Our Secondary Data Center

Posted by Posted on by

One of the priorities we set ourselves after our outage is to get a read-only instance up quickly in our secondary data center in the New York metro region. We have always had the data backed up in that data center and our software team has been busy making it run in read-only mode. Today, we are happy to announce that Zoho CRM is running in read-only mode from our secondary data center.

The URL is https://crm-ro.zoho.com

We will keep this read-only mode accessible to customers all the time, so that in the event of an outage in our primary data center, your access to data will be preserved. We still have some limitations we are working through – for example, database information is synchronized immediately between our silicon valley and New York data centers, while document attachments have a one hour lag, which we will be addressing soon. There are some other limitations: search does not work and most of the integrations, particularly Mail integration do not work. All of these limitations will be removed in a few weeks. Our goal in getting this out quickly is to give you access to your core CRM data from our secondary data center.

Other Zoho services are working on read-only versions right now and we will open them up as soon as they are ready and post announcements regarding them.

This read-only version is just a start. We are working to provision our New York data center fully so that all Zoho services can run in hot standby mode, so when the next disaster strikes, we could switch over the service quickly.

Our Friday Outage and Actions We Are Taking

Posted by Posted on by
On Friday, January 20th, we experienced a widespread outage that affected all Zoho services. The outage started around 8:13 am Pacific Time. Zoho services started coming back online for customer use at 3:49 pm, and all services were fully restored at 6:22 pm PST. We absolutely realize how important our services are for businesses and users who rely on us; we let you down on Friday. Please accept our humblest apologies. 

The cause of the outage was an abrupt power failure in our state-of-the-art collocated data center facility (owned and operated by Equinix) in the Silicon Valley area, California. Equinix provides us physically secure space, highly redundant power and cooling. We get our internet connectivity from separate service providers. We own, maintain and operate the servers and the network equipment and the software. The problem was not just that the power failure happened, the problem was that it happened abruptly, with no warning whatsoever, and all our equipment went down all at once. Data centers, certainly this one, have triple, and even quadruple, redundancy in their power systems just to prevent such an abrupt power outage. The intent is that any power failure would have sufficient warning so that equipment, databases most importantly, can be shut down gracefully. In fact, the main function such data centers perform is to provide extreme redundancy in power systems, provide cooling for the equipment and provide physical security. Absolutely no warning happened prior to this incident, which is what we have asked our vendor to explain, and we hope they would be transparent with us. I do want to say that Equinix has served us well, they are a leader in this field, we have never suffered an abrupt power outage like this in 5+ years. But they do owe us and other customers in that data center an explanation for what happened on Friday. They restored power quickly, but the damage was done because of the abruptness of the outage.

As of today, while we have a substantial level of redundancy in the system, we still rely on our data center provider to prevent an abrupt power outage (it happened once, so it could happen again), and we are scrambling to prevent another power outage from becoming a service outage of the duration that happened Friday. Those are literally the first steps we are taking. This includes having our own separate UPS systems (in addition to all the UPS systems, generators and cleaned-up utility power that our vendor provides in the data center), and database servers that have batteries in them so they can be gracefully shutdown in an event like this.

Now let me acknowledge that it took us way too long to recover, and let me explain first why it took so long, and then explain what we are going to do about it in the future. In a nutshell, when every database cluster and every server went down, the sheer amount of testing and recovery work overwhelmed our human-in-the-loop recovery system. There was never any issue with the safety of the data itself. 

We have a massively distributed system, and the design intent of such a distributed system is that everything would not fail at once and parts of the system can and do fail without impacting overall  service availability. The problem was that when the entire system went down, it required manual recovery. We had about 20 people working to restore services, but there are well over 100 clusters, of which about 40% of them had errors – basically the redundant database servers within a cluster were out of sync with respect to each other. The inconsistency across replicated instances is usually very slight – perhaps a few bytes off in a 100 GB instance, but the only thing that matters is that there is inconsistency, however slight. This is recoverable without any data loss (except for the data that was entered just at the exact moment when power went down). This process is necessary to ensure that there is no data corruption and all data is consistent across the replicated instances. In most instances this was fast, but in some instances recovery took time, and the number of such slow-to-recover instances caused delays in the overall recovery. In fact, the first few clusters we tested were OK, so we relied on that to provide an estimate of recovery time that proved too optimistic, based on later instances that had a problem. There were way too many such clusters that took time for the 20 people to recover them in parallel. In effect, the human system was overwhelmed by the scale of the problem. That’s why it took us so long to bring all services back up. 

We do have all data mirrored in a data center in the New York region (also owned and operated by Equinix) and that data center was not affected by the power outage.  All the data was present in that secondary data center, so there was never any possibility of any data loss, even if all our primary servers had been wiped out completely. But we do not have sufficient capacity to run all Zoho services from that secondary data center as of today. We have 3 copies of your data in the primary data center, and usually 1 or sometimes 2 copies in the secondary data center. That means that we do not currently have: a) sufficient data redundancy in the secondary data center by itself to run all the services – i.e assuming the primary data center is totally dead, or b) sufficient computing capacity to process all the traffic by itself in the secondary. Our secondary data center serves to protect customer data but it could not serve all the traffic.  We intend to address this issue ASAP, starting with some of our services first.

Our first focus is on preventing an outage like this from happening again, and the second is faster recovery when disaster strikes. We have been working on this second problem for a while already and we will accelerate this process. Additional steps we are taking include: a) offer better offline data access so customers never have to go without their mission-critical business information b) offer read-only access to data on the web quickly, so at least access is preserved while we work to recover the editable instance and c) more automation so recovery from a large scale incident can happen with less manual intervention.

During this entire episode, our first priority was to make sure customer data remained safe. No customer data was lost, but because incoming mail server queues overflowed (the mail store went down), some mail bounced back. We are working on preventing such a thing from happening again, with a separate mail store instance.

We will keep you steadily updated on the progress we are making on each of these priorities. Hardware progress is going to be the fastest (buy and install new systems), and software is going to be the slowest (implementing better automation for faster recovery is going to take time), but we will keep you posted on all the progress we make. That is our promise.

This was, by far, the biggest outage Zoho has ever faced. We absolutely understand that many businesses rely on Zoho to go about their work on a daily basis. We can understand how many customers were disappointed and frustrated by this outage. We too, are extremely upset about this incident. 

In the coming days we will be refunding a week’s worth of your subscription to each and every single customer, whether you complained or not. We know money will not give you back the time you lost, or compensate you for the hassle and trouble, but we hope you’ll accept it with our deepest apologies. While the money is not going to mean anything to any single customer, at an aggregate level, it does affect us, and that punishment would serve as a reminder to ourselves not to let this happen again. That is ultimately the best assurance I can give.


Root Cause Analysis of our December 14 Outage

Posted by Posted on by

We had a 3.5 hour outage of Zoho Mail, Zoho Support & Zoho Books, between 8:45 AM and 12:15 pm PST on December 14.  First of all, I want to apologize to our customers and our users. We let you down and we are sorry.  We know how important it is to have access to the vital business information you entrust with Zoho; our entire company runs on Zoho applications, so we understand this in an even more intimate way. With a view to preventing such incidents and improving our response to them when they do happen, we reviewed the root cause of this outage and our response to it. This post provides a summary.

The outage arose from a simple configuration gap between our software applications and the network. One part of the health-check mechanism built into our software (ironically, the very part that is designed to prevent an outage impacting customers) made an unneeded reverse DNS request to resolve an IP address to a name. The network stack did not adequately provide for this reverse DNS look-up; it worked until it stopped working. In effect, we had a single point of failure that our network ops team was unaware of, because the software had this implicit dependency.

When the reverse DNS failed, the health-check mechanism (incorrectly) concluded that the software applications were failing, and proceeded to restart the “failing” application servers one by one, as it is programmed to do. Of course, even after application servers restart, the health-check would still fail, because the failure was not due to the software itself.

Since the failure was happening in a disparate set of applications that share no resources (no physical servers, no file systems or databases in common) other than being part of a sub-network, the initial suspicion was focused on the switches serving that sub-network. In reality there was a shared dependency but that was not immediately identified. This wasted precious time. In the end, the reverse DNS problem was identified, and the fix itself took just a few minutes.

Here are the lessons we learned on December 14:

1. Subtle information gaps can arise between teams that work together – in this case the network ops and the software framework teams. The software had a dependency on a network component that the ops team did not appreciate, which created an unintended single point of failure. The failure mode was always there, and on December 14 it came to the surface.

Action: We will make the configuration assumptions made by every piece of software much more explicit and disseminate them internally. Our monitoring tools also will be strengthened to check the actual configuration against the assumptions made by software.

2. After the outage, precious time went on testing various hypotheses, though the root-cause, as it turned out, was quite simple. This is the most stressful period, and some of that was inevitable, which is why prevention is so vital. We had 5 people checking out various aspects of the system, but they were not aware of this software dependency. If they had known, it would  have taken a few minutes to fix it, and instead the outage lasted 3 hours.

Action: We are reviewing our incident response procedures, to bring in people with relevant knowledge on the spot more quickly. We will also provide more training to our operations team members, so they could diagnose and troubleshoot a broader set of problems. Our monitoring tools also would be strengthened to provide more diagnostic information.

3. This is a more fundamental, mathematical problem in any feedback loop: adaptive, fail-safe mechanisms can have unforeseen or unintended behavior and ultimately cause failures themselves. Basically a failure is declared, and action taken, and if the diagnosis of failure is wrong (often very subtly wrong) and therefore the action taken is not appropriate, those actions can then feedback into the fail-safe mechanism. We have humans in the loop for this precise reason, but in this case, there was a single point of failure that the ops-team-on-the-spot was not aware of, so they could not stop the cascade.

Action: We are reviewing our fail-safe mechanisms to identify such cascades and involve the human-in-the-loop better.

To summarize, we believe the failure was preventable, and in any event, the outage should have been resolved a lot sooner.  Once again, please accept our apologies.  We have resolved to improve our tools and internal processes, so we could do better in future.

Sridhar Vembu

Services Are Up, Root Cause to Follow

Posted by Posted on by

We have restored all Zoho services. They should be working normally. Our teams are monitoring the situation closely, and if you encounter any trouble, please let us know.

We know this is not what our customers and users expect from Zoho. We let you down today. Please accept our humble apologies. We have launched a full investigation of the root cause, how we responded to it, what we could have done better to both avoid the problem and how we could have resolved it sooner. We will post this report as soon as it is ready.

Update: Our preliminary information is that reverse DNS look up failed in one of our subnets, which caused the outage on some of the services. We still are trying to determine why it failed, why it didn’t trigger other type of alerts and why this failure resulted in such a service outage. Once the incident is fully understood, I will post a detailed report. 


Some Zoho Services Down, Please Check our Twitter Feed @Zoho for Updates

Posted by Posted on by

As of 9:05 AM Pacific time, we started encountering difficulties in many Zoho services. The services affected are Zoho Mail, Books and Support, as well as some sporadic issues in other services. We have narrowed it down to network issues in our data center, and we are analyzing it. We will restore services as expeditiously as possible. 

Please check our twitter feed: http://twitter.com/#!/zoho  for updates.

We apologize for the inconvenience.
Update: as of 12 noon PST, we have restored Zoho Mail, Support and Books. We still are not completely out of the woods, and we are monitoring all services closely. Meanwhile, we are also looking at the root cause of today’s outage. We will make a detailed post as soon as we determine the root cause, and we will outline the actions we are taking to make sure this does  not recur. Please accept our humble apologies.

Why I Am a Businessman and Why My Employees Are My Heroes

Posted by Posted on by

(picture courtesy: The Hindu online)
That scene above is from the suburb of Chennai (Tambaram) where I grew up, where my parents still live, but in reality, it could be anywhere in India. We get monsoons this time of the year every year, yet, every year this is how it looks for a few weeks. 
I am in business because I believe there is really only one solution. We need 25x more businesses, 25x more jobs, 25x more infrastructure before India could be considered a livable country. I hope to live to see a day when one city, just one fucking city, in India will offer a world-class quality of life. Today, almost no city in India comes close to offering what would be considered an acceptable quality of life. That is why I wake up everyday and go to work, because my dream is to create sufficient profits to directly fund the infrastructure we need to live a decent life.
That crazy idea, directly funding infrastructure out of profit, would practically get me thrown out of my job in a nano-second if Zoho were a public company. That is why I don’t take venture capital and won’t ever take my company public. The good news is we have decent profit, and as we grow, my plan is increasingly becoming less and less crazy. 
But it is not really about me. I live the good life. It is my employees in Chennai that are the true heroes. Every single one of them go to work under these conditions – and I am going to be there tomorrow. Scenes like these are literally everywhere in Chennai. Our people write code,  support customers, teach and learn from each other, all under these conditions. The fact that our people ship the products they do under these conditions is a testimony to the resilience of the human spirit. I grew up in exactly these circumstances, but I still think of it as nothing less than a miracle that we are able to do the work we do.
My employees are just like people in that picture. The are hard working good people, forced to live under a broken system. Governments in India are famous for a singular lack of vision and imagination. In that picture, for example, you would think the local municipal body would be responsible for the civic infrastructure. No, that would be too obvious and too sensible, and of course that is how any functioning system anywhere in the world would work. In India, financial responsibility for that road would be divided between the state government, which manages the affairs of only 70 million people and the central government (because there is a railway line in the picture, and railways are all run by the central government, including the local train network of Chennai). The local municipal entity is powerless and broke, and it exists to basically receive petitions and forward them to the state or central governments – well, when they get out of their wheeling and dealing to getting around to doing any work at all.
We have ministries and departments for everything under the sun, from condoms to condiments, from rain forests to railways, from fisheries to fertilizers, from information technology to imaging satellites. Except that we don’t focus on basics like roads, sanitation or drinking water, because, well, that would be too obvious and too sensible. Ronald Reagan’s dictum “Government is not the solution to our problems, government is the problem” applies with astonishing force and clarity in India. That is why I am a businessman.
PS: You want to see more pictures like this, here is a slideshow from the Hindu.