Ever thought a product or a platform could “give you scalability?” Now the availability of cloud computing is compelling IT shops to plan for new architectures. And suddenly those who thought they had scalability, don’t.
If you believe that a scalable architecture for an information system, by definition, gives you more output in proportion to the resources you throw at it, then you may be thinking a cloud-based deployment could give your existing system “infinite scalability.” Companies that are trying out that theory for the first time are discovering not just that the theory is flawed, but that their systems are flawed… and now they’re calling out for help.
“Tell you what,” the development tools vendor told me on the day Microsoft officially launched Windows Azure, “This solves the whole scalability problem forever, doesn’t it? Just stick your application in the cloud. Infinite scalability!”
Scalability, we’re told, is the inherent ability of an information system to acquire more resources and continue to perform normally. But before a business invests in any bigger or better resource, such as Microsoft Exchange, SharePoint, or Windows Server, it’s sold on the premise that the resource can grow as the business gets bigger. The system will be just as affordable, efficient, and practical in five years’ time as it is today. For years, that premise seemed reasonable enough.
But fundamental presumptions implicit in that notion no longer truly apply. It’s as though four cosmic forces converged at the same moment: the availability of cloud computing, the versatility of virtualized processing, the global accessibility of the Internet, and the commoditization of processors and storage. As a result, the capacity of a company’s information systems is no longer directly proportional to its mass. For segments of the economy where information itself is the key product, the very meanings of “big” and “small” have become skewed. So now that formerly “big” businesses look over their shoulder, and see arguably “small” ones deploying what appear to be truly scalable, efficient, practical architectures on a fraction of their budgets, the need to stay competitive is finally forcing companies to stop postponing the inevitable.
For certain businesspeople, scalability has become as important and fundamental a principle as democracy, capitalism, or derivatives. A universe where businesses fail to scale up, and computing resources fail to stretch like a tube sock to meet their needs, is uncomfortable, daunting, scary. Now, “the cloud” — once the penultimate solution to scalability in engineering — confronts both system architects and software developers with a harsh and painful reality: To do business in this universe, we have to start completely over.
“With the people we’re talking to, the first step is helping them come to the conclusion that whatever they have isn’t working,” says Bradford Stephens. He’s a software engineer whose startup firm, Drawn to Scale, is working with companies to develop an entirely new data platform for this foreign universe. “They come to it the hard way. Either they’ve lost data or they’ve had to change their business model, which is surprisingly common. So once they make that realization, it’s more like, ‘Okay, how do you translate from talking about this relational world to talking about this scalable, [Google] BigTable world?’”
Stephens is among the first of a new breed of system architects and developers (and, more frequently, both) who were the first to realize, in all the literal senses this implies, the scale of the problem at hand. He is also the first to admit there are no set solutions, no best practices, no templates — at least not yet — for remodeling business applications. There are simply too many unique factors. Businesses shrink, they merge, they are acquired, they shed departments, they absorb other departments, they outsource various tasks (sometimes seemingly at random), they cease to exist for certain intervals and are resurrected under new names.
“What we’ve found is, these companies who are experiencing these big data scalability pain points, of course, wish they had tackled the problem earlier,” says Stephens, “But these sorts of problems you don’t realize you have until you try to solve them.”
Cloud computing enabled workshop-sized companies — many of them fresh startups — to deploy within months, and in some cases weeks, service-oriented architectures, using lightweight and often open-source frameworks, establishing instant information services for clients on an Internet scale. No less importantly, virtualization brought forth a radical reorganization of the fundamentals of system architecture, such that the resources any company has at hand at any one time to process a job switched from a constant to a variable. Suddenly, someone’s basement business could have enough processing power to address as many customers as a multi-billion-dollar enterprise, for just the few days or hours it needed that power. But unlike the enterprise, it could drop that power when it no longer required it.
Notes Stephens: “FlightCaster, a little four-person startup, trolls through dozens of gigabytes of data a day to predict if your flight is going to be late. Not that long ago, information on that scale was only generated by really big businesses, because only they had the ability to generate it.” I know what you’re thinking: Isn’t that supposed to be a good thing? For FlightCaster, yes, but not for the enterprises like Sabre Travel — the modern culmination of American Airlines’ multi-billion-dollar Sabre network — that depend on the information services that FlightCaster’s iPhone app just outmoded. The problem with “disruptive technologies,” to borrow a phrase from Microsoft chief software architect Ray Ozzie, is that they’re so damn disruptive.
The sudden (and occasionally catastrophic) impact of disruptions such as this on businesses and the economy in which they function has swept Microsoft itself into a role it never expected to play: counseling.
“The first thing every architect needs to know is, ‘You are not alone,’” reassures Justin Graham, Microsoft’s senior technical product manager for Windows Server. “Seek help from a Microsoft partner and/or Microsoft Services to get on the right track. In the current environment, we understand budgets are tight, which is why Microsoft TechNet, MSDN, and User Group Communities are available to share information and assist. From a process perspective, make sure to not overlook the opportunity a re-architect provides to optimize the infrastructure. Thinking about it solely as, ‘How do I merge these technologies?’ [may not be as helpful as] ‘What is the best and most optimized infrastructure that can make the new organization successful?’”
For more and more businesses, the re-architect is the counterpart of the clinical psychologist. Whether hired as a consultant or full-time, he enters the situation knowing that the only way to find a path toward a solution — the only way he can “optimize the infrastructure” — is by divorcing the business from its own false perceptions and bad habits.
One of those habits is throwing new hardware at the problem: the traditional route for “scaling up.” “People need to change their mindsets from buying hardware by default, to that of a small company where they can’t afford to buy hardware, back in the day,” remarks Sean Leach. Newly installed as the CTO of domain registrar Name.com, Leach has recent experience as architect for what might be the ultimate high-scale application: UltraDNS, a real-time, high-security extension to the Internet’s Domain Name System, providing a constantly updated directory of verified DNS addresses to which businesses subscribe. As far as scaling up is concerned, Leach has been to the mountaintop.
“Cloud computing is actually making this problem a little bit worse,” states Leach, “because it is so easy just to throw hardware at the problem. But at the end of the day, you’ve still got to figure, ‘I shouldn’t have to have all this hardware when my site doesn’t get that much traffic.’”
Failure of Scale
The core of the problem — which Stephens, Graham, and Leach all address in their own ways — is that existing business applications were not designed to scale, or mutate, or metamorphose as they’re pressured to do. In an older world, a business would invest in more horsepower, buy new hardware, scale up. But that presumed that the organizational structure of the business was its constitution, and that as it grew — as all things seem to grow, linearly — the structure would simply magnify.
“Traditionally, not only do companies think about databases, but also their silos,” notes Bradford Stephens. “You’ve got your customer transaction database, your business intelligence (BI) database, your HR database, and the database that powers your Web sites. And this data is copied and replicated so that you’ve got a customer in your billing system and one in your CRM. So people think about data not as data, but like vertical units. This is my BI data, my data warehouse data, my transaction data.” Regardless of the inefficiencies and redundancies introduced when relational databases aren’t used for the job they’re designed for — relating — each department’s ownership of its own pocket of data is respected at all costs. Scalability breaks here.
In a way, it was inevitable for this way of thinking to become ingrained into companies’ operations, because of the way they manage budgets. Each department, like a rival sibling, scuffles with all the others for bigger outlays. So to demonstrate to the CFO that it deserves more, the department consumes more… more bandwidth, more gigabytes, more processors. “Years ago, if you were the only guy in an enterprise with lots of data, the ability to spend money was proportional to how much data you generated. It was linear,” says Stephens. As a result of this thinking, Microsoft, IBM, and Oracle historically were only too happy to oblige.
Within a few years’ time, the rise of high-bandwidth media via the Internet has enabled one person, or small groups, to consume colossal amounts of data — terabytes per person — so that consumption rate no longer implies either relative size or worth.
It’s here, Stephens says, where scalability shows up where businesses want it least. When businesses simply relocate their existing information systems model to the cloud, its problems become magnified at Internet scale. “Twice as many connections generate four times as much data, and 10 times as many connections generate 100 times as much data… Little mistakes can have exponential impact. When you run up really large scales — for example, if your code isn’t efficient and you’ve got a wrong loop somewhere — not only are you increasing network traffic 200% across your little, tiny network of two machines… but you’re going to be increasing network 500% across [all these cloud] machines that you’ve rented. That’s an extremely costly mistake. So in a distributed, scalable world, you have to have metrics and you have to have cost analysis. You have to plan for that from the beginning.”
So Microsoft is advising these re-architects to turn an introspective mirror toward their own businesses. “A best practice of focusing on management will serve a system re-architect very well,” advises Justin Graham. “If the architect is trying to merge two organizations, or re-architect to meet the changed priorities of the business, any problems that existed in the past will exist in the future if a straight migration approach is taken. Think about how management and an optimized process can help you re-architect the infrastructure to be agile and scalable.”
Bradford Stephens agrees: “Scalability does not imply efficiency. You may have a million boxes doing something, and not doing it particularly well. When you build architectures, the first thing you have to worry about is scalability, because you can’t back-fill it. It’s nearly impossible.
“Efficiency is incredibly important because it saves you money; and in this cloud world, in fact, it’s actually more important to be efficient because the impact of inefficient code is so much greater, and it can be measured. If it takes me five boxes to handle 20 transactions per second, and then I can make it so I can handle 40 transactions per second, that’s something you can measure and you can justify spending engineering output on. In sort of a cloud-ish or scalable world, that translates directly into saved money and saved time.”
Sean Leach is happy with the notion of turning that mirror even closer towards oneself. “Ninety-nine percent of the time that I’ve seen performance problems that are blamed on the database, it’s actually the person who wrote the queries or [who wrote] the application on top of the database. So there’s no magic ‘
fast = true’ flag you can set; every database is very similar. Some of them scale better with a lot of records… But at the end of the day, it’s the person who writes the application that will be the reason for it being slow, not the software itself.”
It’s not that scalability does not, or should not, exist. It’s that we should divorce the business’ growth pattern from that of its information systems. Rather than use tools such as Windows Workflow Foundation to model application tasks around what people do (especially if their departments may cease to exist in a few years), instead model the application around what the information systems should do. Let the cloud disrupt the way of thinking that binds users, and departments of users, to computers rather than resources. Then build front ends to model how people should use those systems. If a company’s information systems are designed well from the outset, our experts tell us, with a loosely coupled approach between processors and business methods, then the company could completely mutate to unrecognizable proportions, and yet its systems’ logic may remain sound.
However, Sean Leach does allow us some breathing room. “Sometimes you have to use hardware. But what you generally find is, the people up front don’t take the time to plan ahead. There’s two trains of thought: There’s just ‘Get it out there,’ and if you have to scale, then that’s a good problem to have, worry about it later. Ninety-nine percent of applications never get any traffic, so they don’t have to scale, right? That’s one train of thought. The other one is, you can spend six months trying to figure out, ‘How am I going to scale this properly?’ You design it from the beginning… and then you don’t actually ever launch something.”
For Leach, the trickiest part is finding a happy medium where developers can design these systems so that it’s not a complete rewrite when the software becomes popular. “Design it up front so you don’t have to throw the hardware at it so early in the game. If there comes a time where you really do need to throw the hardware at it, then fine, make sure that your system can support it. But the goal should be that you shouldn’t have to throw that extra hardware at it until you really need it.”
Data Scales By Itself
Depending on the business growth pattern, conceivably its logic may never have to scale. What will scale is its data. Thus, suggests Stephens, businesses should design applications that don’t require incremental rescaling just to account for periodic explosions in data consumption.
“Just because your application scales doesn’t mean your data does. Data is what drives businesses; data is the important part,” says Stephens. “You have to rethink everything you do with data from the bottom up… If your application is well-designed, you should only have to change your data layer. Your front end should be totally independent. But you’re going to have to go in and write queries, or make certain assumptions that you’re talking to a distributed cluster, and you’re going to re-architect your data layer — not your whole application. Re-architect your data layer for that new reality.”
Among the tools Microsoft has developed to that end is one that recognizes that these terabytes per person aren’t really relational databases at all, but rather documents tagged by records that clog those databases. So in Windows Server 2008 R2, Justin Graham tells us, the company implemented File Classification Infrastructure as a way for data layers based on document retention to evolve sensibly.
“FCI allows administrators to apply classification rules to documents on file servers,” said Graham. “These classified files can then have actions taken against them based on their classification. The best part, these classifications are carried to SharePoint if the file moves.”
Some of the alternative approaches Stephens suggests are indeed quite radical, including a frame of mind he calls “NoSQL” — avoiding the use of a relational database in circumstances where tabular frameworks (employee ID / e-mail sent to customer / customer ID / document filename / document ID, send date…) are too binding. Just as inflexible business models stifle application scalability, Stephens believes unfathomable schemas stifle the scalability of data. And as big as data is becoming, moving it to the cloud becomes nothing more than relocation.
Leach points to the rise of new, relatively simplistic, non-relational, yet highly scalable database systems, such as Apache’s Cassandra project, and the open source Redis project, as enabling business to deploy associative databases using simple key/value pairs (document ID -> document location). Both, he says, enable you to mix and match technologies so you don’t have to rely on a relational database. “Relational databases are very good at certain things,” Leach says, “But some things might be overkill, where you might need a simple key/value pair.”
Scale by Leaps and Bounds
One of the most frequently cited, modern scalability case studies involves the global messaging service Twitter. Twitter underwent at least four complete architectural overhauls just since its launch a few years ago, as systems designed for a few thousand simultaneous users suddenly found themselves servicing 350,000. Each of Twitter’s foundational components (especially the database framework Ruby on Rails) was blamed for what appeared at first to be scalability roadblocks.
However, in retrospect, it’s entirely feasible that if Twitter’s architects had designed its system from the beginning to undergo these same changes — if they were planned rather than unanticipated — they may very well have made the exact same architectural choices. Each choice may have been the right one, if only for a few months.
Perhaps, as Name.com’s Sean Leach advises, there’s a lesson to be learned from Twitter: Rather than planning to scale incrementally — which, if a business finds or regains success, may now be impossible — it should plan to rework its fundamental architecture as needed, in phases, as old architectures that met earlier requirements are no longer applicable.
The coefficients of Leach’s formula may sound a bit obtuse, but in light of Twitter, perhaps the sky’s the limit. “Let’s say, the biggest you’re ever going to get is a trillion customers. But instead of designing for a trillion customers, design to a million customers so that when you get halfway there, you can redesign the system over time to be able to support a billion customers. Then when you get to almost a billion… just build it and get it out there, and then you spend your time up front and don’t worry about scaling. Plan in phases where you scale to X, and then when you get close to X, you start thinking about Y…as opposed to waiting until X happens before you worry about scaling,” he says.
Leach suggests this instead: “Take the time to sit down up front and ask, ‘What would we look like if we got really busy?’ and then plan to that. That’s Application Design 101: What should our hardware look like today, what will it look like in two years, and then what would we need to do to be able to make the system support what we look like in two years? That’s simple. You’d think that would be something everybody did, no matter what. But it’s not always the case.”
In light of the often daunting tasks that system re-architects face today, Bradford Stephens offers a frame of mind that he calls, “How to Make Life Suck Less.” It’s based on a simple concept: Failure will happen. Thus, plan for redundancy such that when components do fail, they get disconnected and maybe replaced, maybe not. But you still get some sleep. While that sounds like a dangerous camouflage for throwing hardware at the problem, at one level, it’s really not: Virtualization and the cloud make it feasible, and even affordable, to follow Leach’s milestones: to rescale by powers of ten rather than multiples of two.
“We can sort of see the destination in the distance, and we see what we think the path is, but it may be kind of curvy,” Stephens warns. “We may not know what’s right around the bend… If you do it right, you’ll know because you won’t be getting the 2 a.m. phone calls, and deploying some buggy code won’t bring down your entire network. There will be a lot of roadblocks in the way, and there’s going to be a lot of emerging best practices. But there’s no set process that people go through when they say, ‘I need to scale my data infrastructure,’ or, ‘I need to evaluate a scalable data platform.’ We’re not there yet; and of course, we will be, because many, many people will have to tackle this problem. But it’s a transitional period.”
We end with the one conclusion that articles on topics of this scale should perhaps never leave the reader with: We don’t know the next steps on this road. This is uncharted territory. What we do know is that businesses can no longer afford to develop solutions around the edges of the core problem. They can’t just point to the symbols for their systems’ various ills (latency, load imbalance) and invest in the symbols that represent their solutions (scalability, the cloud, CMS platforms, social network platforms) as tools for postponing the inevitable rethinking of their business models. Throwing your business model at the cloud doesn’t make the symptoms go away; indeed, it magnifies them. Symbols aren’t solutions.
I’m reminded of when Charlie Brown’s friend Lucy famously found her first political cartoon, where she had meticulously devised an appropriate symbol for every one of the world’s problems, printed in the newspaper. When she asked — hoping for some praise and admiration — whether he thought this cartoon would solve the world’s problems as she so earnestly intended, Charlie Brown responded, “No, I think it will add a few more to it.”
Want more like this? Sign up for the weekly IT Expert Voice newsletter so you don’t miss a thing!