Startups driving the first wave of big data infrastructure deployment have ‘landed’ into enterprise accounts and are centered around a limited number of well-defined use cases and data types. The second wave is unfolding now, as enterprises build critical business applications that span data silos across heterogeneous infrastructures.
Business drivers focus on self-service tools that help operations teams, data scientists and statisticians address performance, governance, compliance, and security of highly-sensitive data. Enterprises are now creating Chief Data Officer and Chief Data Ops roles to meet the challenges of data discovery, reliability, performance, access control, and life-cycle management.
Menlo co-hosted an event with Unravel and Waterline to discuss how web-scale companies like LinkedIn have internally built processes and controls for data management and operations, and how larger enterprises are planning to deal with these same issues. The panel was moderated by Nick Heudecker of Gartner, featuring Daniel Sturman from Cloudera and Kapil Surlaker from LinkedIn.
In the aftermath of the event we asked Kunal of unravel and Alex of waterline to weigh in on a few more questions.
KUNAL AGARWAL, CEO UNRAVEL
What are the common reasons that affect the performance of big data applications?
We see companies running a variety of mission-critical big data applications such as product usage reports, recommendation engines and pricing analytics. What we also see is that these applications are poor-performing, unreliable or simply fail a majority of the time, which is not acceptable in a production environment.
A number of factors, which are spread all over the stack, affect the performance and reliability of big data applications. Some of these factors include quality of the code, configuration settings, data size and layout, resource availability and allocation, health of services, etc. Therefore pin-pointing the root-cause of a performance problem is no easy task and requires visibility into every layer of the big data stack.
Who in the organization is responsible for solving these application related problems in big data organizations?
Unlike the traditional RDBS world where roles and duties are nicely defined, the big data world is still in a flux. In some companies the operations or platform teams solve these problems while in others end-users are made responsible for their own applications or pipelines. In most organizations only a handful of people are truly capable to solving these problems as it requires deep knowledge of Hadoop or Spark internals coupled with operations and administration experience. We have seen many big data teams in which engineers are spending more than half of their day solving performance and reliability issues instead of being productive on big data systems.
Describe how the operations team or engineers go about solving big data application problems?
Today big data teams rely on infrastructure metrics and application logs. Infrastructure metric tools show graphs for utilization levels of CPU, disk, network, etc. These are very good to understand system utilization but lacks information about application behavior.
Engineers usually sift through raw application logs to uncover cause and effect of problems. But this approach does not scale well. A Hadoop application for example can spawn hundreds of MapReduce jobs, each of which can have thousands of tasks. The engineer would have to go through more than a 100,000 log screens to diagnose the problem in this case. Once they have narrowed down the problem, they then use trial-and-error techniques to find the optimal solution. It’s no wonder that it can sometimes take several days and weeks to solve a single application problem.
ALEX GORELIK, WATERLINE DATA
Tools like datameer and platfora and tableau are great to analyze and visualize the data, but if you have millions of files in a data lake how does one find the right data to open in an end-user tool – wouldn’t you need a data catalog?
Finding the right data in a lake of millions of files is like finding the needle in the stack of needles. Having a data catalog for the entire data lake enables a business analyst or data scientist to quickly zero in on the data they need without asking around or writing code. Data Discovery tools enable the end-user to manually explore one file at a time, and they can populate their internal catalog as a by-product of the manual exploration, but they don’t create the full data catalog of every file in the data lake. The benefit of having a complete data catalog is that it automates 80% of the data discovery process and enables the business to answer any questions they have much more rapidly.
How would you build a data catalog in such a large lake? How can all the data be tagged quickly?
Given the volumes and ongoing changes to the data in the lake, it’s not practical to manually explore and tag every file and field in the data lake. A data lake makes it very easy to get data in because it doesn’t require a data model, but the flip side is that the business metadata still needs to defined for the data to be understood and useful to business users. The challenge is that this has to be done as the data lands in the lake, which means that there is a need to automate how that’s done so the business can start using the lake within minutes or hours not months.