Even the mighty shall sometimes cloudfail

Example of a Single Point of Failure
Image via Wikipedia

From the annals of Gigaom:

http://gigaom.com/2009/06/10/amazons-ec2-service-suffers-outage/

Now, If you are not a social gaming startup, but are a supply chain or POS network hosted on AWS, you can do the calculus on whether AWS uptime (excellent by any measure) is better than a solid in-house solution for mission critical infrastructure.  Maybe for some, it computes, for others maybe not.

But when the cloud fails, your alternatives have to be in place. Such as: POS systems might have a set of distributed machines to capture inbound records and route card transactions. Rapid Replenishment systems might capture transaction logs for instant replications once your cloud host comes back. You might have a set of managed APIs that broker to another cloud and then reconcile the resynch.

Many paths. However, there are some businesses that can tolerate the outages that are sure to occur as more move to remote services. One thing is for sure: The single point of failure is not just the cloud infrastructure and platform providers. The land rush to get the mid market onto PAAS solutions has been somewhat willfully blind regarding the following fact – most small /med biz has only one high speed connection, and most have not thought through the issues of hot comms failover at multiple sites.

PAAS that Gas, boys. One of the best things about hosted services in the cloud has been hardly spoken about – It’s great to have all remote offices and facilities routed to a central gateway, rather than running a mishmash of multi-point routers with arcane rules. Downside, comms. Even most SMBs in the 2-25M $$ gross revenue range have been struggling with this. It is what has made the Cisco certifications a viable IT job and created a freelance market.

Reblog this post [with Zemanta]

The Strategist: Mitigating Cloud Computing Client Services Risk via Trusted, Blind API Brokers – Part IV

n

The Strategist: Mitigating Cloud Computing Client Services Risk via Trusted, Blind API Brokers – Part IV

The plain truth: There will be no cloud computing industry initiative where competitors will agree to ‘blind pool’, and backstop each other’s failures and outages.There may eventually be great open standards to move VMs and apps off of your cloud, but no set of commercial continuity services will ever hearken back to the days of centralized SNA shops with real plans, proven market capitalization, and legitimate durability. Rather, I would expect the next few years to look like this:

  • When a really big utility cloud (AWS / Google) goes down – you will get an apology and a paltry credit. The only consolation will be that it won’t happen often, and you will not be able to exceed their up time by running your own servers. The house usually wins.
  • When a moderately well financed cloud provider starts to fumble, there will be ample warning, because folks will be watching and pinging. Have a plan, or get continuity coverage now, or when it really becomes specifically available for cloud users.  Don’t say I didn’t warn you.
  • When a small, under-financed but buzzed up PAAS, SAAS, Cloud, whatever….fails overnight, taking your operations with it – comfort yourself by thinking how much you saved while it was working those 5 months in 2009. Really, now, Consider having at least a local or S3 proxy dup your data. Get insurance. Think before you trust business operations to a startup.
  • If I was Andy Rooney from 60 Minutes: “Have you ever noticed that all of these hosting providers have a page on how great their hosting facilities are? Even the cut rate ones say, ‘we are a level 1000 bunker with a year of diesel backup power and armed guards, and multiple super network links?” I mean, did they all copy the same page to make us feel safe?”

There is no perfect state of reliability, and that even in the days of highly centralized data shops, continuity was planned. We are transitioning from this mercantile, Web Hosting mentality, to one of running business essential applications remotely. These services are splitting into ownership  categories of incumbent giants, and start ups that have a semi permanent ‘?’ on their forehead until they achieve operational liquidity. The former apologizes and credits your account, the later disappears int the night.

The vast majority of SAAS productivity app providers use existing utility compute services, in whole or in part. It’s a cost thing – perfect continuity in the cloud, on a per-client-per-use basis would be infinitely costly. There will, however, be no comprehensive industry clearing house offsetting failures -  not in the sense of a services for profit model.

There are some shared risk examples where pooled trusts for infrastructure failures exist (to the best of my knowledge, these were at least proposed in underwriting requirements):

1) Telecom, National data haulers , Carriers-Carriers, and submarine cable system operators sometimes negotiate emergency settlement and peering agreements as a prerequisite to satisfying underwriting requirements. Sometimes these agreements predate the insurer’s audit, and are just good business. Don’t confuse these contingency plans with standard settlements – they are negotiated for extraordinary outages and lock in fees and technical requirements. Only the very large carriers can enter into these agreements with true peers.

2) Municipal and State Gov. Emergency Radio communications networks, SMR, and certain common carriers (terrestrial radio specialty comms) sometimes have emergency coverage agreements that are mandated by statute.

3 )Interstate Nat Gas and Petroleum Pipelines. Etc.

The real message here is that underneath a pool of policies is a risk pricing model that is often further underwritten by a re insurer; risk pools come together faster if there are means to offset the preponderance of risk. Flood Zones are hard to mitigate, and pools are still formed, sometimes under the stentorian bark of a state regulator. But in the case of our beloved IT clouds, we have yet to get to a place where risk to an individual business that depends on a cloud service can be priced, mitigated against, and potential technical failures limited, in their worst instances.

You may now go read up on all the happy hoooha about, “the open cloud manifesto, cloud interoperability, etc.”, good luck with all that – I’m an optimist too.

We are talking here about commercially brokered services that are paid for by a pool of insurance companies, and that are funded by premiums. We don’t get there until the primary service providers are certified, rated, and as operationally good as they can be. At that crucial juncture, where a critical mass of SAAS and Cloud hosts agree to these ratings and certs, we can price the baseline risk of outages via standard actuarial methods. Subsequently, risk offsets that are purely technical in nature can be tested and put into production. Finally, when technical services are proven to be feasible, then we can look to the reinsurance market, and viola, we have a business.

Question #1): How many service providers and Insurers have to get on board, at least provisionally, to make a real retail or B2B market that multi-line agents and specialty carriers can sell into?

Answer #1): My research was cut short before I got that far. I felt that my client knew the answer and was testing to see if I came up with a verifying figure. My best guess is that at least 35% of the top 1000 SAAS and Cloud vendors and at least three major underwriters would be required to make a realistic market for policies and payouts that make any sense whatsoever.

Question #2): Other than the actual insurance underwriting and policy sales, is there a real business model here in operating the technical services pool of a blind trust API broker/ Data mirroring / continuity services for the insurance industry? How big ?

Answer #2) Oh yes, oh my G-d yes. I am writing this series because I got far enough in my work for the last client, that I did see the foggy future in a way that mature analysts sometimes do.

How big? I believe that operating the Trusted Services Pool will be worth about 60 – 120 million annually when it hits it stride. There may be ancillary channels and opportunities along the way that could lift revenues to 250M. So, it’s not going to be a Cisco or an HP, but a specialty business funded by the small insurance premiums paid by Small and Medium Businesses that make cloud computing or SAAS a critical part of their operations. (Much of my work product was projecting these numbers).

As a matter of fact, the industry as a whole may become hamstrung if these risk offsetting services are not brought online.

Maybe someone will read this very long and not too interesting series of articles (would you rather not  be reading some romance novel?), and put me back to work researching and creating the product road map, lassoing potential insurance industry partners, and start making this a reality (all that work!).

The services offered to offset cloud computing risk is a modest challenge to provision, and is really just another cloud service with special sauces for monitoring, security, and trust. That’s it.

You were expecting nuclear fusion? The goal of these pooled services is to cap the worst losses that imply risks to the majority of small and medium business that may encounter inoperative remote services – thus  mitigating the top tier of policy payouts. The insurers pay for and pool these services with the premiums collected from the insured businesses.
Blind Trusted Services:

The Trusted Services Pool has to have all the attributes of trust to be established. Fiduciaries and controllers, technical management, and operations staff have to be checked out. The capitalization has to be audited, and its own operational contingencies have to be assured. Do you see what is happening here? The insurer’s technical services pool has to be as good or better than the hosting providers that it is backstopping.

Technical Services:

The goal is to offset the worst risk cases for data loss and continuity losses to operations. This does not mean an up time guarantee. A certain major percentage of the insured population’s data and transactions has to be preserved for a reasonable premium. To support a menu of insurance coverage levels, the following technical services will probably have to be supported over time: ( I am avoiding an exhaustive technical discussion, who has time?).

  1. Transaction Log Mirroring and ReplicationThe most basic, non data heavy service for small business is to maintain transaction logs. These logs can be shipped and ready-replayed to reestablish and reconstruct business transactions if a cloud provider goes down or out. Especially for POS and counter top retail business that are making the move from a distributed server based system, half the battle is capturing the transactions.
  2. Data Storage ProxyIn addition to table-based transactions, businesses that store document images or objects may require a backup proxy to alternative cloud storage. No big hurdle here, other than the assurance and credibility.
  3. VM machine image ready standbyIf and when (some say now) a set of elastic services cane be frozen and placed on near-line stand-by, this a service that was discussed in my research. In meetings with several VM vendors, including some heavy hitters from IBM’s superserver division, it became apparent that many instances of ready standby could be held in stasis, and re-synchronized to transaction logs in fairly short order, especially if we are catering to small and medium businesses, and not say, City Bank.I guess this is where the open cloud initiatives are going. It seems that many of the VM vendors are leading the way. For the purpose of the trusted pool, I felt after a period of study that this is possible and actually in practical use in limited cases.
  4. API call brokerage for live services uptake.There are already existing services that broker web API’s. These services provide scaling, monitoring, billing, etc. Trusted services for the insured pool would maintain a similar brokered pool of API’s that would either pass through the 1st level of calls directly to the provider, or would be cut in as an alternate route if a timeout exceeds a predetermined limit.There are a few issues here that need massaging, as it not the business of the trusted services pool to provide transaction level assurance when your cloud or SAAS provider times out for a few minutes. Rather, a trusted API brokerage really makes the preceding items more elegant to provision. Even competitors can backstop each other’s outages if the Trusted Services are blind to the parties and payments settled by the trusted pool.

The up sells beyond trusted services might cover all of the value-added items provided in the course of selling business continuity services, such as records management, facilities, and telecommunications. These would add revenue lines, and complement the agencies commission incentives.

The technical services discussion could be covered in much more depth, and I may take that on after I clear my desk. However, I wanted to close this series and show that some folks, including my former insurance industry client, are seriously looking at the business of providing indemnification services and underwriting to cloud computing clients.

Reblog this post [with Zemanta]

Rating and Certifying the Cloud Hosting and Web Application Providers. Part III

Rating and Certifying the Cloud Hosting and Web Application Providers. Part III

I have been slowly morphing my consulting practice. I usually offer myself as a product sector strategy asset. Product Managers and VP’s in the on-line applications business hire me to shoulder some of their burden when targeting specialist sectors – you know, industrial, technical, services, professional. These established clients usually have an idea of where their development efforts are heading. I came in to refine and prove the potential numbers. I developed approaches to paid subscriptions, industry specialty requirements, and I found innovative ways to exploit trade specific marketing. I was the product manager’s helper, and It was a good gig until about 2007, when the economy got soft. Analysts are the first to have their contracts cut.

Now I am delivering what I learned as an analyst, and applying this to evangelizing small and medium businesses. These folks are the end users I had quantified, targeted, and interviewed in my work for web applications providers. Small and medium bizfolks perceive the benefits of hosted services and cloud computing. They clearly perceive the benefits of fault tolerance, licensing advantages, and a simplified communications topology. These smaller accounts are certainly numerous. Can they abide having recurring computing fees forever? They certainly know that their internal server and workstation / mobile infrastructure (as traditionally delivered), costs them big time when things go bad.

The SME  / SMB, in other words, gets it. They get the benefits of Web based, cloud hosted stuff. They like getting out from under the local IT support guy, or the internal IT guy that they are held hostage to. They look forward to a time where individual routers with special configurations are replaced by safe, centralized fault tolerant networks, servers, and comm infrastructure that they can provision and pay for in a rational way. They just don’t know if they can trust you and if you will be around long enough to justify the cut over.

So, before I close this series, which might include one more post on the brokering of technical services between partners and competitors to backstop business continuity failures, I will talk briefly about ratings and certifications for any remote provider of compute and storage – out there in the cloud.

Established utility computing providers, like AWS, are probably uninsureable as far as client’s needs are concerned; they are too big, and any coverage they do have insures only their own facilities and operations, which does accrue somewhat to the client’s benefit in the very long run, but does nothing when the downtime occurs. In the case of the big dogs, your insurance is their size and need to maintain a reputation. Eventually we will get our way, and instances of client computing services will get risk based pricing, preceded by business viability ratings, and of course, certifications for good facilities, operating procedures, and back office accounting standards. I’m willing to bet the ISO is working up something in their wild and crazy working groups as we speak.

One more thing: Why is PAAS different?

Briefly: clients using unitary applications or suites have invested a certain amount of time moving from  thick client project management to a hosted solution (one example). They have probably identified ways of moving the data off the platform (I hope), and so on. They are using an application, and we have all changed applications. PAAS is like marrying your company to .Net or some other standard. There is an investment, a rather large one for the SME, actually. For the lone developer making web apps, it’s ok.

The PAAS landscape is made of some very innovative and funny systems. I think you know what I mean. Some remind me of 4GL, some will let you host a language and framework, but not the integral database, some have language environments that are made from whole cloth. As a group they are fascinating and right on the cutting edge, and they are, as a group, under capitalized and illiquid. There are exceptions, but I will bet you the best dinner in Boston that one would be hard pressed to find a PAAS provider that would allow an industry ratings organization to inspect their capital and operations profile.

If a SAAS application company is illiquid in its essence, then we find another, move the data. If a PAAS company is under capitalized, we have a larger set of problems. The way migration has been handled for PAAS failures has been shameful.

Someone once asked me if the 25M round for an on-line storage provider places them in a well capitalized position; my answer was, “it depends, but generally, no, it is not considered well capitalized for the intended target and use case – 25M in a VC round ain’t shit when rating a crucial service provider that has not attained sustained profitability and near perfect uptime.” Continue reading

The Strategist: Underwriting Business Continuity in the Cloud. Part II.

This the second article which rounds out the issues covered in the previous post.

If you want to know why these issues of ratings and insuring continuity are important, I direct the reader to this article about on-line file hosting site Carbonite.

So, as I stated in the first post of this series, I was booked by what looked like a large, well financed client; well, as my client’s went (with the exception of France Telecom) they were large-ish. These folks were a 100+ year old regional insurance company that specialized in professional lines. What’s that, you ask? Professional and specialty underwriters serve, well, professions, verticals, and businesses. They usually are not auto, home, or life insurers, but they are often resold by multiline carriers. Why should you know this? Huh!

Professional lines insure business operations risks with certain carriers targeting coverage by profession; their expertise and actuarial models require specialization in order to correctly price the risk of business interruption, and to price the premiums and payouts that indemnify the customers of professional and industrial services operations. One simple example: field service coverage, in which the technical organization are covered against customer claims of damages, losses, and liabilities that occur in the course of repairing equipment. The other side is, of course, simple coverage for interruption of operations.  Some engineering disciplines (Civil, structural, design, architectural, aviation, you get the idea) can buy coverage for E&O (errors and omissions).

Ya Ya, what does this have to with hosted services and SAAS PAAS Cloud? Answer: Insuring business continuity was a game of physical premises insurance, which evolved into records and facilities, and now, today, optionally covers servers, workstations. software, and systems. It is a mishmash of offerings, and many industries have varying degrees of dependencies on internal IT infrastructure. The insurance products for Small and Medium businesses are semi-flexible, while mega enterprises have core needs that exceed what professional lines can provide, and instead rely on customized underwriting for the Fortune 1000. Continue reading