Technology options in the big data space often creates confusion and concerns.
Confusion is because, there are too many options and it’s not one or two monolithic platform which provide solution to every possible problem.
Concern is because, there is a heavy influence of open-source software, platforms and equally 3rd-party product vendors, who don’t have a long record in the space.
In essence, this is both a ‘problem of too many’ and ‘problem of very few’ !
Lets understand these dynamics in detail.
Problem of too many
As you explore big data technology options, you’d soon find that there is always more than one way to do things and there are too many technology options to do so.
Example : Data integration from one or more data sources.
If you have an existing data integration tool/platform such as Informatica or AbInitio, you could look to see if you can talk to and process data from your data sources.
If it can, then question is can it handle all types of data sources in play.
If the answer to that is a no, then you’d have to understand the data sources’ characteristics ( data type, frequency of arrival, quality, treatment needed etc. ) and accordingly pick a tool ( Kafka or Sqoop or Flume or Spark Streaming or Custom built )
In this example, we haven’t even yet talked about the performance considerations, but trying to make a first cut choice.
The number and choices on big data platforms is to some extent comparable to number and choices on database technologies.
Example : In the legacy world, one would look for database options such as Oracle or Microsoft SQL-Server or IBM DB2 or, MySQL.
Similarly, given the prominence and market adoption, you would come across names such as Cloudera, Hortonworks, IBM Big Insights, MapR etc.
Each of these big data companies’ has their own product vision and roadmap, which they project as choices to their customers, while they work very closely with the underly open sources communities. I will cover big data platforms separately.
Problem of very few
Platform pick and choices are taken on a set of criteria’s such as
- History of the company
- Product roadmap and development over years
- Market adoption and general opinions
- Periodic reviews and guidance from market, industry analysts
- Financial stability and viability
- Sales, Support and Services options
- Licensing models and Price options
- You would find that there are few big data companies, which has a long history presence and public !
You’d find most of them to be a private company, in matured start-up mode, backed by prominent and stable venture capitalists.
However, you’d be amazed to see the market adoption of their products to be wide spread, across industry segments and across the world !
This state, puzzle’s decision makes on a number of levels.
What reference points should be considered to pick a technology vendor or two ?
Should we use the same measure as those with which we picked our legacy tool or technology or should we frame new methods for the new world ?
Should we plunk down lot of cash on these newer products or should we wait and watch for those products and in turn supporting companies to mature or should we make investments today ?
How does such decisions play out on licensing and product rollout related cost structures and are there any other related hidden cost ( tangible and otherwise ) ?
All of the points expressed above should not scare one to stay away, but to actually realize that the ecosystem of Big Data should be seen with a new type of lens !