BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

What is Hadoop and What Does it Mean for Supply Chain Management?

Following
This article is more than 9 years old.

I've been hearing the term "Hadoop" everywhere.  I asked my colleague David White, ARC's analytics expert, what it was and whether it was something that supply chain executives needed to know about.

David explained that Hadoop was probably being most widely used in marketing. because  it is best suited to unstructured and semi-structured data:  Images, social media data, call center transcripts, and clickstream data, for example.  So marketers use Hadoop to  improve their understanding of customers and prospects, and their ability to

sell them the right product, at the right time, using the right channel.

But as companies strive to become demand driven, the marketing and supply chain functions need to become more tightly integrated. So for example, leading companies are exploring how to use social media and internet traffic to forecast new product introductions, or omni-channel retailers are seeking to use this data to understand a promotion’s potential lift by channel.

Unstructured data is potentially useful for emerging supply chain risk management applications, to drive a better understanding of the supply chain programs at leading competitors, and for recruiting and retaining supply chain talent.

David explained to me that discussing Hadoop can be tricky because it’s a bit like the blind men and the elephant.  Hadoop is lots of things.  Depending on which way you want to look at it, Hadoop is:

  • A distributed data management platform – really a cut-down distributed operating system.  It is designed to manage and work with immense volumes of data, and scale linearly from just a few to thousands of commodity computers.  In its earliest incarnation, it consisted of three parts, one for data management, one for programming, and one to make it all hang together.  The Hadoop Distributed File System (HDFS), Map/Reduce, and Hadoop Common respectively.
  • Open source.  Hadoop originated at Yahoo in 2005 as the infrastructure to support a web search project.  Since then, Hadoop has migrated over to the Apache Software Foundation (“Apache”).  As such, it is available for anyone to download and use, free of charge.
  • An ecosystem.  Like many open source projects, Hadoop has spawned a diverse and evolving ecosystem of enhancements, add-ons, and alternatives.  Just to name a few, these include Pig, Hive, YARN, ZooKeeper, and Avro.  The ecosystem also includes commercial vendors that provide value-added services based on Hadoop.
  • Difficult.  Hadoop is really a software project, not a software product.  As noted, you can download it free of charge.  But, unless you have fairly rare technical skills – or plenty of time on your hands - implementing, scaling and supporting that distribution can be a bit of a challenge.  Consequently, a number of companies now provide a more polished software distribution and supporting services.  Hadoop is available as a managed service too.

Putting those definitions and technobabble to one side, it’s always important in the technology game to follow the money:

  • Commercial Hadoop startups such as Cloudera, HortonWorks and MapR have recently scored massive venture capital investment.  Cloudera closed a $900m round of funding in June.   Not to be outdone, Hortonworks announced a $100m funding round in March, with an additional $50m investment in June.  Likewise, MapR raised $110m in June, with Google Capital leading that round of investment.
  • Large mature enterprise IT vendors such as HP, Intel and IBM are backing Hadoop too.  HP invested $50m in Hortonworks in June (see above) to drive closer integration between Hadoop and HP’s other big data technologies.  For its part, Intel was part of Cloudera’s recent $900m financing round, owns 18% of Cloudera, and has a seat on the board too.  IBM has its very own Hadoop distribution and also offers Hadoop in the cloud.

So your IT department really shouldn’t be on the fence about Hadoop because it’s a given.  It’s a done deal. It’s going to happen.  Hadoop has so much momentum at the moment it’s hard to see an alternate data management infrastructure emerging in the foreseeable future.  Almost anyone that wants to manage massive amounts of unstructured (or semi-structured) data will have Hadoop.  So, instead of wondering what Hadoop is and whether it’ll be part of your future, get ahead of the game and ponder three more important questions instead:

  1. What’s the best Hadoop approach for my company?  There are three main approaches that each trade off different cost profiles and the technical skills required:  Downloading the free distribution from Apache requires intensive and ongoing technical skills; using a commercial distribution reduces the skills burden; pursuing the Hadoop-as-a-Service approach minimizes the technical skills needed.
  2. What analytic infrastructure are we going to use on top of Hadoop?  Hadoop is just a data management platform, a cut-down operating system.  By itself, it adds little value to an enterprise.  In earlier IT generations, relational databases breathed life into the Unix operation system, and productivity applications made Microsoft Windows pre-eminent.  In the same vein, choosing the right analytic database and toolset for Hadoop is more important than Hadoop itself.

Planning further ahead, it behooves supply chain managers to start asking pointed questions of their favorite supply chain application vendors:  What hooks are they providing to integrate Hadoop databases, and what plans to they have to incorporate Hadoop as part of the supporting technology behind their own applications?