TSL Blog: Handling Big Data for Hadoop

Unit 3 Individual Project
Handling Big Data for Hadoop
ThienSi Le
Colorado Technical University
CS882-1603C-01
Professor: Dr. Elizabeth Peralez
13-August-2016

A. Introduction

With new data computing, automation, Web technologies in the competitive data-driven market and Internet-based economy, data at low storage cost and fast processing explode and become ubiquitous and ample in both public and private sectors. Big Data, a generic term for data in 5V’s (massive Volume, Variety in forms, high Velocity in processing, truthful Veracity, and Value) pose major challenges of extracting or transforming for insightful information to many organizations (Gartner, 2013).

Handling Big Data in a traditional approach such as centralized database server systems, RDBMS (Relational Database Management) with Microsoft Excel, Access, etc. no longer works. To analyze Big Data in colossal volume, many organizations such as Amazon, Apple, Google, IBM, Intel, Microsoft, Intel, etc. develop their own high-tech statistical tools or use the advanced analytical tools. Some typical analytical tools are AWS (Amazon Web Services), Tableau, R, Apache Spark, VIDI, Quantum GIS, Google Fusion Tables, etc. (Machlis, 2011). Big Data Analytics (BDA) is a systematic process of evaluating large amounts of data in varying types for the purpose of identifying patterns, relationships, unknown correlations and other useful information. Today, the applications of BDA are used in many fields: healthcare, automotive, presidential campaigns, highway traffic, insurance, banking, social networking, law enforcement, etc. (Natarajan, 2012) One of these analytical tools in popular demand in Big Data Analytics is Hadoop. This report consists of three parts to highlight Hadoop ecosystem, discuss data in the intent of Big Data, and prepare data for in the Hadoop system as follows:

- Part I will describe briefly Hadoop ecosystem

- Part II will examine Big Data, Big Data technologies, challenges, benefits, etc.

- Part III will discuss Big Data that include manageable size how to use Hadoop dealing with big data. Especially, Big Data can be handled before Hadoop uses data for processing.

B. Part I: Hadoop Ecosystem

Hadoop project is an Apache open source framework written in Java that allows

parallel processing of large datasets such as Big Data across scalable distributed clusters of servers and computers in distributed storage and computation (Apache Software Foundation, 2014). Apache Hadoop has emerged as the de facto standard for managing large volumes of unstructured data (Intel, 2015). Hadoop framework consists of four primary components:

1. Hadoop Common:

This module contains Java libraries and utilities such as filesystem, OS level abstraction, etc. that support other Hadoop components.

2. Hadoop YARN:

This module owns job scheduling and cluster resource management.

3. Hadoop Distributed File System (HDFS):

The HDFS is a distributed file system that provides high-throughput access to application data. Refer Figure 1 for more information.

4. Hadoop MapReduce:

This module is YARN-based system performs parallel processing the large data sets.

Figure 1: A Simplified Hadoop Architecture

Source: Adapted from Hadoop Software Foundation, 2012.

Google solved the problem of huge amounts of data by using MapReduce algorithm in Hadoop. It breaks a big file or large sets of data into many data chunks, then

distributes them to multiple servers, e.g., data nodes in the HDFS. Each data chunk can store 128 MB of data. The top cluster node or name node will manage the file system metadata where the servers are real inexpensive commodity servers for storing data chunks as shown in Figure 2. When a query is executed by a client, the top cluster node obtains the file metadata information, and then reaches out to the multiple servers and gets real data blocks. Hadoop provides a command line interface for administrators to work on HDFS. The top cluster node has a built-in Web server that allows users to browse the HDFS files system and view some basic statistics. The HDFS collects the outcomes from other data nodes to form the final result dataset (Apache Software Foundation, 2014).

Figure 2: HDFS’s simplified block diagram.

Source: Adapted from Borthakur (Apache Hadoop Organization, 2012).

Figure 3 shows how regular data is processed in traditional data warehouse along with processing Big Data for data science in ETL process.

Source: Adapted from Intel, 2016.

CRM: Customer Relationship Management

ERP: Enterprise Resource Planning

ETL: Extract, Transform, and Load process

Sqoop, Flume, ODBC JDBC: Hadoop analytical tools

C. Part II: Big Data

With the advent of new technologies, devices, sensors, social networks, communications, etc., the amount of data generated by humans and organizations has grown rapidly every day. In 2003, there were 5 billion generated gigabytes. Humans generated the same amount of data in two days in 2011, and in every ten minutes in 2013

(TutorialsPoint, 2016). Data that is called Big Data is a collection of massive datasets that can only be processed with the advanced analytical and statistical tools.

1. Sources of data

Big Data are generated from different devices and applications.

- Black box data

Black boxes from airplanes, helicopters, jets, etc. contain voices of the flight crew, microphones, earphones, etc.

- Social media data

Social media networks Facebook, Twitter, Linkedin, etc. contain streaming data, online conversations, users’ posts from millions of people.

- Stock exchange data

The stock exchange data have information on share trade in stock markets.

- Power grid data

The power grid data hold information from consumers and power stations.

- Transport data

The transport data include information of a vehicle such as model, capacity, etc.

- Search engine data

Search engines such as Google, Yahoo, etc. retrieved a lot of data and information from various databases.

2. Data types

Big Data can be characterized into three typical types (AllSight, 2016).

- Structured data:

Relational data can be stored and analyzed in RDBMS. They include POS data email, CRM data, financial data, loyalty card data, help desk tickets.

- Unstructured data:

This type of data is the most difficult to deal with. Data are generated from GPS, blogs, mobile data, PDF files, web log data, forums, website content, spreadsheets, photos, clickstream data, RSS feeds, word processing docs, satellite images, videos, audio files, RFID tags, social media data, XML data, call center transcripts, etc.

- Semi structured data:

XML data are data in the form of XML or JSON format.

Figure 4: Typical data types of Big Data

Source: Adapted from AllSight, 2016.

3. Benefits of Big Data

Big Data are emerging as one of the important technologies on the data-driven market. Big Data carry some typical benefits.

a. Marketing agencies monitor data and information in the social networks, e.g., Facebook, Twitter to learn about the response to their promotions, campaigns.

b. Product companies and retail organizations can plan their production by using information such as preferences and product perception of customers.

c. Hospitals, insurance agencies can provide better and high-quality services based on data from patients’ previous medical records.

d. Big Data technologies provide more accurate analysis that may lead to more concrete decision-making for cost reductions, operational efficiency, and reduced risks.

4. Big Data technologies

With benefits of Big Data listed above, scholars and data scientists from Tutorials Point (2016) distinguish Big Data technologies into two categories:

a. Operational Big Data

Operational Big Data technology includes a system like MongoDB, NoSQL systems that provide operational capabilities for real-time interactive workloads. Data are mostly captured and stored for cloud computing similar to RDBMS.

b. Analytical Big Data

Analytical Big Data technology comprises system such as Massively Parallel Processing (MPP) database systems, and MapReduce that have analytical and statistical capabilities for complex analysis

These two classes of technology perform complimentarily and frequently. Both can be deployed together. The table below shows a comparison between two classes of technology.

Table: Comparison between Operational and Analytical Systems

Source: Adapted from Tutorials Point, 2016.

5. Big Data Challenges

Big Data pose main challenges to many organizations as follows:

a. Capturing data is difficult.

b. Curation is not easy.

c. Storage requires huge memory, disks.

d. Sharing data is complicated.

e. Transfer data take a lot of time because of huge size.

f. Analysis of data requires advanced analytical tools.

g. The presentation is sophisticated.

Notice that to tackle these challenges; organizations often use enterprise servers in a large scale configuration.

D. Part III: Handling Big Data

Big Data mean massive volume in many forms. The traditional tools such as RDBMS are often out of touch. The huge size of data file becomes a big issue for analysis and statistics in many organizations.

1. Manageable size of data

Even though Hadoop can perform analytical analysis on data with accurate results, it still has some limitation in run time (Stucchio, 2013). For a flash drive of less than 100 MB, it is better to use traditional RDBMS, and Microsoft tools like Excel, Access, etc. for queries and analysis. For the size of hundreds of megabytes, the data file is not Big Data but is too big for Excel. Users may use Pandas built on the top of Numpy or Matlab, R for analyzing fairly large data. For processing data files of 100GB, 500GB or 1 TB, users may use two tetra-byte hard drive (costs $94.99) or 4 tetra-byte (costs $169.99) and install and use Postgres. However, with a data file of 5 tetra-bytes, users may have to use Hadoop because other choices such as big servers or many hard drives are considerably more expensive.

2. When to use Hadoop?

According to Burns (2013), the company uses Hadoop for large distributed data tasks when time is not a constraint. For example, it is useful to execute overnight reports to review daily transactions or scan historical data dating back several months. To execute the real-time analytics in meta-markets, Hadoop is slow because it optimizes the

batch jobs by checking every file in a database. So, Hadoop limits its value in an online environment that requires a fast response and crucial performance. MongoDB NoSQL database is preferred for fast, responsive results in this situation.

3. Dealing with Big Data

a. Large data sets

Hadoop’s HDFS handles applications with large datasets. Typical size is in gigabytes to terabytes. It provides high aggregate data bandwidth and a single cluster of hundreds of nodes. It can support tens of millions of file a single instance (Borthakur, 2012).

b. Streaming data access

The data sets in applications for Hadoop must be streaming accessed for high throughput. The low latency of data access may be significant in the operation. Data files are prepared for batch processing and ignorant interactive use.

c. Simple coherency model

Data files will be used in a write-once-read-many access model. A file cannot be changed after created, written, and closed.

d. Moving data is more costly than moving computation.

Computation of an application on huge data file should be executed near data. It is better to set the computation closer to where data reside.

e. Data integrity

Data can be stored reliably in HDFS even there is a failure in name node, data node, and network partitions. Checksum mechanism in HDFS verifies corrupted data chunks in data nodes.

f. Data organization

- Data chunks

The data file is divided into data chunks or data blocks in HDFS. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB or 128 MB. Each chunk will reside in a data node.

- Staging

The name node inserts a file name into the file system and allocates a data chunk for it and other operations.

g. Data accessibility

Applications can access data in HDFS in many ways. Users can use Java API, C language wrapper, or HTTP browser. Massive data sets should be stored in the underlying storage and infrastructure platform for the database that is capable of handling the capacity and speed of big data initiatives, particularly for mission-critical applications

Database servers should be built on solid-state storage. Storage infrastructure for Big Data should support automated tiering, deduplication, compression, encryption, erasure coding and thin provisioning. For faster networks, 10 Gb or 100 Gb Ethernet bandwidth should be used on the platform.

4. Preparation of Big Data

In order to process data properly in Hadoop, the data such as big data must be prepared in the right format. They should be cleansed or filtered out in appropriate forms before feeding them in Hadoop for data processing. According to Wayner (2013), Hadoop ecosystem as described briefly in Part I provide many software tools that can be used for cleansing data. Some of them are described as follows:

a. Ambari

Ambari is GUI wizard to set up clusters of nodes to control a cluster of Hadoop tasks.

b. HDFS

The HDFS is described in detail in section Part I, Item 3 above.

c. HBase

Hbase handles data in big tables similar to Google’s BigTable. It stores the big tables of data for sharing among multiple nodes.

d. Hive

Hive processes files in HBase by extracting bits for snippets into a stash for query tasks.

e. Sqoop

Spoop as a command-line tool provides large tables from the RDBMS’s to control tools such as HBase, or Hive.

f. Pig

Piq handles the data in its own Pig Latin language in parallel in the cluster.

g. ZooKeeper

ZooKeeper uses a file hierarchy and contains the metadata for other nodes for task synchronization.

h. NoSQL

For NoSQL data, this tool stores and retrieves data in NoSQL databases such as

Online MongoDB, Cassandra, or Riak.

i. Mahout

Mahout provides algorithms’ implementations for filtering, data analysis or classification to clusters.

j. Lucene/Solr

Lucene handles indexing large blocks of unstructured text in Hadoop where Solr integrate the ability to parse XML format.

k. Avro

Avro wraps the data together with the schema in serialization manner. Data can be parsed in XML or JSON format.

l. Oozie

Oozie combines the small tasks that are broken up from the job. It manages a workflow properly.

m. GIS tools

GIS (Geographic Information System) is used to handle geographic maps or images in three dimensions.

n. Flume

Flume gathers all data and information that will be stored in the HDFS.and they are ready for analysis.

o. SQL on Hadoop

Data in SQL format such as HAWQ, Impalla, Drill, etc. can be used for an ad-hoc query in simple SQL.

p. Clouds

A cloud platform can attract Hadoop jobs in rent machines to crunch on big data sets in the shortest time.

q. Spark

Spark can process data as Hadoop but at the fast speed where data are in cache.

E. Summary

This Unit 3 Individual Project presented an in-depth report on Big Data. It was divided into three sections:

- Part I described Hadoop, its basic components, architecture.

- Part II examined Big Data in various characteristics.

- Part III discussed how to handle Big Data in appropriate formats before Hadoop processed them that included manageable size, streaming data access, data integrity, data staging, preparing data, etc.

In summary, Big Data are raw data like crude oil that needs to be refined before they can be used in applications for Hadoop’s analytically processing.

REFERENCES

Apache Software Foundation (2014). What is apache hadoop? Retrieved November 08, 2015 from http://hadoop.apache.org/

AllSight (2016). Extending the Hadoop architecture. Retrieved August 07, 2016 from

http://www.allsight.com/hadoop-architecture/?gclid=Cj0KEQjwuJu9BRDP_-HN9eXs1_UBEiQAlfW39pml8hZvAr6kW2zirnq6SRAiHx47kbyh6BGsk0zUY48aAlUF8P8HAQ

Borthakur, D. (2012). HDFS architecture. Retrieved August 08, 2016 from

https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Streaming+Data+Access

Burns, E. (2013). Handling the hoopla when to use Hadoop and when not to. Retrieved

August 08, 2016 from

http://searchbusinessanalytics.techtarget.com/feature/Handling-the-hoopla-When-to-use-Hadoop-and-when-not-to

Gartner Group (2013). Gartner predicts business intelligence and analytics will remain a top focus for CIOs through 2017. Press Release. Las Vegas, NV. Retrieved June 4, 2015 from http://www.gartner.com/newsroom/id/2637615.

Intel, (2015). Extract, transform, and load big data with apache Hadoop. Retrieved August 9, 2016 from http://hadoop.intel.com.

https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Machlis, S. (2011). 22 free tools for data visualization and analysis. ComputerWorld. Retrieved August 8, 2016 from http://www.computerworld.com/article/2507728/enterprise-applications/enterprise-applications-22-free-tools-for-data-visualization-and-analysis.html

Natarajan, R. (2012). Apache hadoop fundamentals – HDFS and mapreduce explained

with a diagram. Retrieved November 07, 2015 from http://www.thegeekstuff.com/2012/01/hadoop-hdfs-mapreduce-intro/

Stucchio, C. (2013). Don't use hadoop - your data isn't that big. Retrieved August

07/2016 from https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

TutorialsPoint, (2016). Hadoop – big data overview. Retrieved August, 8, 2016 from

http://www.tutorialspoint.com/hadoop/hadoop_big_data_overview.htm

Wayner, P. (2013). 18 essential hadoop tools for crunching big data. Retrieved August 8,

2016 from http://www.infoworld.com/article/2606340/hadoop/131105-18-essential-Hadoop-tools-for-crunching-big-data.html

TSL Blog

Sunday, September 18, 2016

Handling Big Data for Hadoop

REFERENCES

12 comments: