back to blog
May 9, 2014 | Bob Wilkinson

The Real Big Data Skillset Gap

I recently read an article by George Leopold (As CIOs Embrace Big Data, Cloud Will Soar) that caused me to ponder this recurring theme of a lack of skillsets needed to roll out Big Data Analytics to the Enterprise.  In the article, Mr. Leopold wrote on why Big Data and analytics could take longer to enter the corporate mainstream - “Part of the reason for the lag is the current lack of analytic skills within corporations”.  This echoed a slide from a 2014 Gartner webcast (Hadoop 2.0 Signals Time for Serious Big Data Consideration) co-presented by Merv Adrian and Nick Heudecker.  It that webinar, Mr. Heudecker cited a statistic that by 2015 there would be 4.4 million Big Data jobs, only a third of which will go filled.  It doesn’t stop there - Google for the term “big data skill set shortage” and you will come up with a plethora of other positions on the subject

What are we, as industry practitioners, supposed to make of all this?  I don’t know about you, but the thought of these millions of missing Big Data Analysts conjures up mental images of monkeys on keyboards (for those of you not familiar, the “infinite monkey theorem” states that monkeys hitting keys randomly for an infinite amount of time will almost surely type a given text, such as the complete works of Shakespeare).  In this context - if we just had an infinite number of monkeys staring at the latest/greatest visualization/BI software we could surely capture all of the Big Data insight there was to get - right?

Well, needless to say I think the answer is no and I’m not convinced that conventional industry wisdom has it right here.  Let’s step back from the skillset question a moment and ask two more important questions – 1) do we have the right Big Data tools, and 2) are we asking the right Big Data questions?  While there are obviously Big Data success stories out there, the honest answer to both at the industry level is ‘no’ and if that is the case we have to be careful about the lens with which we evaluate perceived skillset shortages.

Let’s tackle both of those questions with a slant towards ramifications for a skillset shortage.  First, on the topic of tools, the early days of Big Data were dominated by this notion that everything was “new” – it was all about unstructured data, Hadoop, NoSQL, etc. each of which concretely identified a new skillset (and thus a potential gap).  I think the industry has come along way in some respects but still has a long way to go in others.  On the one hand, the industry has clearly rallied around SQL as an API for reporting and analytics.  Tools like InfiniDB (and Hive, Impala, Presto, etc., etc. etc.) are all, to different degrees enabling SQL access over data stored in Hadoop.  This is huge – it means all the SQL tools, skillsets, etc. all easily translate in the new Big Data world.  Where I think the industry needs significant progress, however, is standardization in and around NoSQL.  The knee-jerk reaction by most is that “of course there are no standards” – its name literally means not SQL after all.   This is a cop out response – it obviously isn’t an easy problem, but it’s a necessary step that would signify a needed level of maturity in the NoSQL space and help to mitigate any related skillset shortages as developers struggle to build and maintain solutions on non-standardized technologies.

Second, another area I think the industry is getting right is the realization that Big Data is all about choosing the right tool for the right job - long gone are the days of the one-size-fits-all database.    Accepting this reality, though, means another new challenge - if we have all these different kinds of data engines in the enterprise, how do we efficiently expose all of these different data sources? This really goes well beyond the “1.0” definition of query federation that is focused on simple bridging of the SQL and NoSQL worlds.  Here I see innovators like Cirro stepping up to take on the challenge.  Their Data Hub product offers not only seamless SQL/NoSQL access, but also embeds the intelligence necessary to interact with these different engine in an efficient manner – things like cost-based optimization, smart caching, dynamic query plan re-optimization, etc.   Done right, this “2.0” version of federation is going to enable entirely different questions (the “right” questions) of the data to be asked and answered efficiently (and without requiring the hoards of IT/engineers/etc. to integrate all these systems).

So – where others see a skillset gap, I see the need and opportunity for better tools.  What else would you expect from a software engineer and Big Data tool vendor? J Obviously, there is another stack layer we still need to address – BI and visualization.  It’s another interesting and active area with ramifications to skillset gaps, but I will save it for a future post. 

What do you think about the Big Data Skillset Gap?  Do you agree with my analysis and call to action?  I’d love to hear your opinions on the matter.  Please feel free to reach out to me on Twitter @bobwilkinson20.