The hype and reality of the Big Data movement was on full display this week at Strata Conference in Santa Clara, California. With a sold-out show of 2,000+ attendees and 40+ sponsors, the conference was the epicenter of all things Hadoop and NoSQL–technologies which are increasingly gaining a foothold in corporate computing environments.
Most of the leading Hadoop distributions–Cloudera, Hortonworks, EMC Greenplum, and MapR–already count hundreds of customers. And it’s clear that Big Data has moved from the province of Internet and media companies with large Web properties to nearly every industry. Strata speakers described compelling Big Data applications in energy, pharmaceuticals, utilities, financial services, insurance, and government.
Even IBM, which is not considered a main tent player in the movement and did not exhibit at Strata Conference, has 200 customers using or testing its BigInsights Hadoop distribution, according to Anjul Bhambhri, vice president of Big Data at Big Blue. One IBM customer, Vestas Wind Systems, a leading wind turbine maker, uses BigInsights to model larger volumes of weather data so it can pinpoint the optimal placement of wind turbines. And a financial services customer uses BigInsights to improve the accuracy of its fraud models by addressing much larger volumes of transaction data.
Big Data Drivers
Hadoop clearly fills an unmet need in many organizations. Given its open source roots, Hadoop provides a more cost effective way to analyze large volumes of data compared to traditional relational database management systems (RDBMS). It’s also better suited to processing unstructured data, such as audio, video, or images, and semi-structured data, such as Web log data for tracking customer behavior on social media sites. For years, leading-edge companies have struggled in vain to figure out an optimal way to analyze this type of data in traditional data warehousing environments, but without much luck. (See “Let the Revolution Begin: Big Data Liberation Theology.”)
Finally, Hadoop is a load-and-go environment: administrators can dump the data into Hadoop without having to convert it into a particular structure. Then, users (or data scientists) can analyze the data using whatever tools they want, which today are typically languages, such as Java, Python, and Ruby. This type of data management paradigm appeals to application developers and analysts, who often feel straitjacketed by top-down, IT-driven architectures and SQL-based toolsets. (See “The New Analytical Ecosystem: Making Way for Big Data.”)
But Hadoop is not a data management panacea. It’s clearly at or near the apogee of its hype cycle right now, and its many warts will disillusion all but bleeding- and leading-edge adopters.
For starters, Hadoop is still very green behind the ears. The Apache Foundation just released the equivalent of version 1.0. So there are plenty of basic things missing from the environment–like security, a metadata catalog, data quality, backups, and monitoring and control. Moreover, it’s a batch processing environment that is not terribly efficient in the way it exploits a clustered environment. Hadoop knock-offs, like MapR, which embed proprietary technology underneath Hadoop APIs claim up to five-fold faster performance on half as many nodes.
In addition, to actually run a Hadoop environment, you need to get software from a mishmash of Apache projects, with razzle dazzle names like Flume, Sqoop, Ooze, Pig, Hive, and Zookeeper. These independent projects often contain competing functionality, have separate release schedules, and aren’t always tightly integrated. And each project evolves rapidly. That’s why there is a healthy market for Hadoop distributions that package these components into a reasonable set of implementable software.
But the biggest complaint among Big Data advocates is the current lack of data scientists to build Hadoop applications. These “wunderkinds” combine a rare set of skills: statistics and math, data, process and domain knowledge, and computer programming. Unfortunately, developers have little data and domain experience and data experts don’t know how to program. So there is a severe shortage of talent. Many companies are hiring four people with relevant skills to create a virtual data scientist.
One good thing about the Big Data movement is that it evolves fast. There are Apache projects to address most of the shortcomings of Hadoop. One promising project is Hive, which provides SQL-like access to Hadoop, although it’s stuck in a batch processing paradigm. Another is HBase, which overcomes Hadoop’s latency issues, but is designed for fast row-based reads/writes to support high performance transactional applications. Both create table-like structures on top of Hadoop files.
In addition, many commercial vendors have jumped into the fray, marrying proprietary technology with open source software to turn Hadoop into a more corporate-friendly compute environment. Vendors, such as Zettaset, EMC Greenplum, and Oracle have launched appliances that embed Hadoop with commercial software to offer customers the best of both worlds. Many BI and data integration vendors now connect to Hadoop and can move data back and forth seamlessly. Some even create and run MapReduce jobs in Hadoop using their standard visual development environments.
Cooperation or Competition?
Although vendors are quick to jump on the Big Data bandwagon, there is some measure of desperation in the move. Established software vendors stand to lose significant revenue if Hadoop evolves without them and gains robust data management and analytical functionality that cannibalizes their existing products. They either need to generate sufficient revenue from new Big Data products or circumscribe Hadoop so that it plays a subservient role to their existing products. Most vendors are hedging their bets and playing both options, especially database vendors who perhaps have the most to lose.
In the spotlight of Strata Conference, both sides are playing nice and are eager to partner and work together. Hadoop vendors benefit as more applications run on Hadoop, including traditional BI, ETL, and DBMS products. And commercial vendors benefit if their existing tools have a new source of data to connect to and plumb. It’s a big new market whose sweet tasting honey attracts a hive full of bees.
Why Invest in Proprietary Tools? But customers are already asking whether data warehouses and BI tools will eventually be folded into Hadoop environments or the reverse. Why spend millions of dollars on a new analytical RDBMS if you can do that processing without paying a dime in license costs using Hadoop? Why spend hundreds of thousands of dollars on data integration tools if your data scientists can turn Hadoop into a huge data staging and transformation layer? Why invest in traditional BI and reporting tools if your power users can exploit Hadoop using freely available programs, such as Java, Python, Pig, Hive, or Hbase?
The Future is Cloudy
Right now, it’s too early to divine the future of the Big Data movement and predict winners and losers. It’s possible that in the future all data management and analysis will run entirely on open source platforms and tools. But it’s just as likely that commercial vendors will co-opt (or outright buy) open source products and functionality and use them as pipelines to magnify sales of their commercial products.
More than likely, we’ll get a mÃ©lange of open source and commercial capabilities. After all, 30 years after the mainframe revolution, mainframes are still a mainstay at many corporations. In information technology, nothing ever dies; it just finds its niche in an evolutionary ecosystem.
Article source: http://www.b-eye-network.com/blogs/eckerson/archives/2012/03/the_hype_and_re.php