Yves, you recently announced Talend Open Studio for Big Data, and the world is really energized about big data. Could you tell us if this is Talend’s first step into the world of “big data”?
Yves de Montcheuil: It’s not our first step into the world of big data. Talend has been working with big data even before it was called big data. We have had connectivity for Hadoop. We have had the ability to load data, to process data within Hadoop for over two years. We’ve been partners with the early vendors in the Hadoop space – for example, Cloudera. We’ve been present at many of the Hadoop events in the past few years. So clearly, Talend is not a new entrant into the big data space. What we are doing with Talend Open Studio for Big Data is that we are putting all of our big data capabilities into the same product. That’s, of course, data integration, which is obviously a key element, but also other features such as data quality and cleansing apply to big data – the same kind of rules and filters that you would apply to conventional data.
When I look at the big data world, we hear a lot about Hadoop. What benefits does Talend Open Studio for Big Data bring to the users of Hadoop?
Yves de Montcheuil: The goal of Talend Open Studio for Big Data is to democratize the deployment of Hadoop to leverage big data. Talend was originally founded on the promise of democratizing integration, and we’ve been extremely successful at that, especially when it comes to integrating databases, applications, cloud, SaaS, etc. Hadoop introduces very high complexity to what you need to design in order to extract value – to extract information – out of that massive amount of data. Typically it would take something akin to a PhD in MapReduce, and I don’t think that PhD has been invented yet. What we are offering with Talend Open Studio for Big Data is the ability to very easily design big data integration and big data quality jobs, connect to sources, connect to targets, get data into Hadoop, process data into Hadoop. In other words, not only integrate it with the rest of the enterprise IT stack – you might want to get data out of Oracle or Salesforce.com, and get the resulting data into Teradata or into QlikView – but also prepare the data, process it directly inside Hadoop. Don’t use Hadoop only as a place to store information, but also use it for what it is: an engine, an extremely powerful and scalable engine to process information. Again, without having to write the MapReduce code, we can abstract those transformations through our graphical interface with simple drag and drop of components, and the underlying code is generated automatically.
Talend Open Studio for Big Data is fully integrated with the Apache Hadoop stack. It’s available under an Apache license, which makes it compatible at the license level with the Hadoop products.
We also announced recently a partnership with Hortonworks, one of the leading providers of Hadoop distributions – the Hortonworks Data Platform. Talend Open Studio for Big Data is now embedded into the Hortonworks Data Platform, and is clearly the reference tool for integrating, for moving and for transforming big data into Hadoop.
Talend’s roots are open source, and big data’s roots are in open source too. It would seem to me that you would have an edge over the competition that does not have your open source roots. Would you agree with that?
Yves de Montcheuil: It’s big data without the big bucks. You want to be able to get the benefits of big data without having to put millions of dollars on the table. And, frankly, a lot of companies have been doing big data for quite some time, but they have been doing it with conventional technologies. You know you can process massive amounts of data with, for example, a Teradata data warehouse, which is an extremely powerful technology, but also an expensive solution. Hadoop changes the game. It brings big data to the masses, and that’s thanks to the open source nature of Hadoop.
Beyond big data integration, is there a requirement for big data quality?
Yves de Montcheuil: There is absolutely a requirement for big data quality. If you are just processing and moving big data without introducing the quality dimension into it, you’re just shoveling heaps of garbage around. So what you want to do is cleanse and enrich the data the same way you would do it for small data or conventional data. I think today anybody who does business intelligence or data warehousing clearly understands the requirement of ensuring the quality of the data. The same holds true for big data except that it’s to the power of ten – at least! – because the data sets are much larger and more complex. If you don’t apply proper data quality, proper data hygiene, to your big data, you’re going to end up with a much bigger problem than what you would encounter in the conventional data world.
In order to do big data quality, one avenue that we are taking is to leverage Hadoop for CPU-intensive data quality functions. Features such as matching, deduplication, and linking of records can consume enormous amounts of resources. Because MapReduce is such a scalable architecture, we have taken the approach of generating Hadoop code in order to perform the data quality functions right inside of Hadoop.
That’s an ingenious way to do it. They always say “Do it at the source,” and you can’t do it any closer to the source than by doing it in Hadoop.
Yves de Montcheuil: It’s doing it wherever it makes the most sense. It can be the source, it can be close to the target or it can be an intermediate engine. The key is to process the data where you have the ability to process it the best.
Good point. Can you share with us some use cases for big data integration?
Yves de Montcheuil: Some of our customers are doing very interesting things with big data. One that comes to mind is a telco company. They’ve been doing traditional data warehousing and business intelligence for a long time. In addition to those conventional sources that are coming from their customer applications, from their billing systems, etc., they are now doing sentiment analysis and monitoring social media – Twitter, Facebook – for people who are talking about them and their services. They are aggregating this information using Hadoop technology alongside a data warehouse that resides in one of the traditional data warehousing platforms. It’s really a very interesting use case where big data actually complements the traditional data warehouse that is in place.
Another use case we are encountering is when Hadoop is used essentially as an auxiliary ETL engine, but one that is on steroids. You have this big scale-out architecture that’s your Hadoop cluster, which gives you the ability to process very large amounts of data, aggregate it, and perform mathematical or statistical calculations on those records. By using Hadoop as the ETL engine that will then load the traditional data warehouse, some of our customers are actually decreasing, by 2 or 3 orders of magnitude, the amount of time it takes to process the raw data and load the data warehouse. They are able to get much closer to real-time data warehousing than they were before.
That’s excellent. Yves, thank you for bringing our readers up to speed on your big data initiatives.
Ron has an extensive technology background in business intelligence, analytics and data warehousing. In 2005, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010. Now an associate publisher at TechTarget, Ron continues to lead the BeyeNETWORK, providing editorial direction and supporting the sales team. Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). Ron also has a wealth of consulting expertise in business intelligence, business management and marketing.
Recent articles by Ron Powell
- Data Replication for Improved BI and Analytics: A Spotlight QA with Matt Benati of Attunity
- Creating a Big Data Platform: A QA with Billy Bosworth of DataStax
- Data Virtualization Now Mainstream: A QA Spotlight with Suresh Chandrasekaran of Denodo
- Scalable Predictive Analytics for Big Data: A QA Spotlight with Clint Johnson of Alpine Data
Article source: http://www.b-eye-network.com/view/15911