Big Data and Power of Hadoop

Big Data and Power of Hadoop

Kartik Dave, Data Science, FedEx Services

IThe pace of IT has become really fast with traditional infrastructure not supporting the modern days demand for data, thus emerging technologies as Hadoop and its companion projects rising to the occasion are silver lining in cloud that will help companies scale the barriers of cost and speed by driving innovation and delivering on factors of volume, velocity, variety and veracity of data.


The consumer, supply chain, financial services, bio-pharmaceuticals, network security, social media, food and many other such industry verticals have just one theme in common – data. The data is ubiquitous. “Until 2005, 1.3 billion RFIDs were instrumented” into things that we buy, sell, use, and discard usually from grocery stores, through online shopping, place where we live, products that we consume and information that gets transmitted through social media platforms. Back in 2011 when IBM did the survey, the number of RFIDs being instrumented into our daily lives has gone up to 30 billion. One might ask what caused this sudden growth of digitization. Well, it’s nothing but “Big Data.”

Big Data

“Big Data is any attribute that challenges constraints of a system capability or business need. With new emerging ways of doing business, data, besides being ubiquitous and abundance, the mechanism and means through which data gets created have become spontaneous, instantaneous, constantly changing and instrumented. For instance, an Airbus 380 spooled 1 billion lines of code with each engine generating 10 TB every 30 minutes for a flight from Heathrow airport in United Kingdom to JFK, New York City in the U.S.; this simply gives us a glimpse of the voluminous nature of data world that we live in. Analysts, flight engineers, safety specialists, innovators analyze these type of data to make better mathematical or analytical models that can help them make faster decisions.

Attributes of Big Data

“Big Data is data about data.” The underlying difference between big data and traditional data lies in big data’s characteristics: Volume, Velocity, Variety and Veracity
1.VOLUME - what seems to be a greatest challenge for companies working with traditional data seems to be the biggest opportunity for companies that work with big data. For instance, social media such as Twitter generates 12+ TB data, Facebook analyzes 25 TBs of log data and NYSE spoons 1 billion TB on a given trading day. The storage capacities to store these data have fast grown from terabytes to petabytes to zetabytes.
1.VELOCITY- rate at which the information is flowing, mainly through manual input on social media, sharing location through maps, etc. For instance, the data is getting compounded annually, it’s fighting a battle to streamline in real-time vs. in batch mode as companies that still follow traditional approach to compete analytically.
1.VARIETY - traditionally, data has been observed to be “tabular, well-defined, collection of block entries - structured and constructed in form of documents, financial information, stock records, personnel files, etc. whereas big data is known for generation diversity, interconnectedness - automated and manual input through smart devices in form of photographs, audio/visual files, location, 3D models, simulations, etc.
This augmented new reality of big data world has opened the flood-gates to derive insights, draw relationships (connect the dots), search for precise action, and generate more profits and savings to the customers.
1.VERACITY - uncertainty of data leads to poor and slower decision making process. For instance, according to an infographic from IBM, 1 in 3 Business Leaders don’t trust the information they use to make decisions. Chris Barnatt in this video believes that - By harnessing the power of big data, the U.S. government estimates to save $300 billion/year.

Big Data - Demystified

Big data poses a quandary and an opportunity at the same time. A quandary because the big data has proliferated into our lives at such a grandeur speed that companies who wants to make sense out of this ubiquity are perplexed by the economies of scale, agility, and iterative nature.

Business savvy people saw a new emerging trend of monetization from data through the continuous interconnectedness and instrumentation of objects that are available to people. For instance, the companies like Google, Twitter, and Facebook can be considered as catalyst or leaders in triggering the big data phenomenon simply by awakening and connecting the world and making the web such a common commodity of information for search, buy, sell, analyze and decompose. Traditionally, data was perceived only in form of financial transactions, human resources, personnel files, bank data, etc.
However, with the advent of big data, there lies a big opportunity to monetize data from photographs, videos, status updates, etc. Pinterest, a website where people connect and share their hobbies and interests, has helped maximize the revenue generation for advertisers and increase sales of those items. Another example is that of Taylor Swift, a pop star who instantly earned her stardom by taking advantage of her singing talent and posting YouTube videos, thus gaining popularity among masses that helped her become a millionaire.
Big data matters more now than ever before because monetization can occur not just from financial transactions where banks offer variety of products to consumers, but also from other channels – social media, e-commerce, consumer products, retail, etc. and gaining competitive edge by harnessing power of big data.

Harness Power of Big Data

One might ask, well if big data is so important, why is not everybody taking advantage of it? It is possible, but large companies and governments are coming up to speed slowly but gradually about learning various methods to take advantage of data.
The secret lies in developing understanding of clustered architecture. Companies have used enterprise architecture consisting of servers, storage, and SAN (storage area networks) to store traditional data, which works well in case of business financial reporting, human resource planning, recruiting, supply chain, etc. where traditional data is structured, cubical and tabular in nature. However, “sorting this data is not only challenging but also like finding a needle in a haystack.”
When one compares the amount of data searched, merged and analyzed by Google in 2008, it was 20 Petabytes
Being able to sort and shuffle data through the large array of datasets is a big deal for product companies such as Google, Pinterest, DropBox, FourSquare, Facebook, and so on simply because 25% of CPU cycle time is spent in sorting. This gives a fresh perspective in economies of scale, costs, and volume where data would be unmanageable and delivery with speed would be a big concern
The solution lies in understanding the basic principles of clustered architecture that will help companies, small-medium and large cut ties with commodity hardware where cost and speed are at an intersection with the economies of cloud.

Interesting Facts

• Sort Benchmark institute reports, University of Padva in Italy established a Penny Sort record by sorting 344 GB data by spending 1 cent! What, you might ask?
• University of California San Diego set a Gray Sort record by sorting 1353 GB data in a minute’s time! How, might be your question?
• Another record by UC San Diego was by sorting 1 TB data in a minute! Clustered architecture is your answer

Subscribe to Industry Era