Unit 3 Individual Project
Handling Big Data for Hadoop
ThienSi Le
Colorado Technical University
CS882-1603C-01
Professor: Dr. Elizabeth Peralez
13-August-2016
Handling Big Data for Hadoop
ThienSi Le
Colorado Technical University
CS882-1603C-01
Professor: Dr. Elizabeth Peralez
A. Introduction
With new data computing, automation, Web technologies in the competitive data-driven market and
Internet-based economy, data at low storage cost and fast processing explode
and become ubiquitous and ample in both public and private sectors. Big Data, a
generic term for data in 5V’s (massive Volume,
Variety in forms, high Velocity in
processing, truthful Veracity, and Value) pose major challenges of extracting
or transforming for insightful information to many organizations (Gartner, 2013).
Handling Big Data in a traditional approach such
as centralized database server systems, RDBMS (Relational Database Management)
with Microsoft Excel, Access, etc. no longer works. To analyze Big Data in colossal volume, many organizations such as Amazon,
Apple, Google, IBM, Intel, Microsoft, Intel, etc. develop their own high-tech statistical
tools or use the advanced analytical tools. Some typical analytical tools are
AWS (Amazon Web Services), Tableau, R, Apache Spark, VIDI, Quantum GIS, Google Fusion Tables, etc. (Machlis, 2011).
Big Data Analytics (BDA) is a systematic process of evaluating large amounts of
data in varying types for the purpose of identifying patterns, relationships,
unknown correlations and other useful information. Today, the applications of
BDA are used in many fields: healthcare, automotive, presidential campaigns,
highway traffic, insurance, banking, social networking, law enforcement, etc.
(Natarajan, 2012) One of these
analytical tools in popular demand in Big Data Analytics is Hadoop. This report
consists of three parts to highlight Hadoop ecosystem, discuss data in the intent
of Big Data, and prepare data for in the Hadoop system as follows:
- Part I will describe briefly Hadoop ecosystem
- Part II will examine Big Data, Big Data
technologies, challenges, benefits, etc.
- Part III will discuss Big Data that
include manageable size how to use Hadoop dealing with big data. Especially,
Big Data can be handled before Hadoop uses data for processing.
B. Part I: Hadoop Ecosystem
Hadoop
project is an Apache open source framework written in Java that allows
parallel processing of large
datasets such as Big Data across scalable distributed clusters of servers and
computers in distributed storage and computation (Apache Software Foundation,
2014). Apache Hadoop has emerged as the de
facto standard for managing large volumes of unstructured data (Intel, 2015). Hadoop
framework consists of four primary components:
1. Hadoop Common:
This module
contains Java libraries and utilities such as filesystem, OS level abstraction,
etc. that support other Hadoop components.
2. Hadoop YARN:
This module
owns job scheduling and cluster resource management.
3. Hadoop
Distributed File System (HDFS):
The HDFS is
a distributed file system that provides high-throughput
access to application data. Refer Figure 1 for more information.
4. Hadoop
MapReduce:
This module is YARN-based system performs
parallel processing the large data sets.
Figure 1: A
Simplified Hadoop Architecture
Source: Adapted from Hadoop Software Foundation,
2012.
Google solved the problem of huge
amounts of data by using MapReduce algorithm in Hadoop. It breaks a big file or
large sets of data into many data chunks,
then
distributes them to multiple
servers, e.g., data nodes in the HDFS.
Each data chunk can store 128 MB of data. The top cluster node or name node
will manage the file system metadata where the servers are real inexpensive
commodity servers for storing data chunks as shown in Figure 2. When a query is executed by a client, the top cluster node obtains the
file metadata information, and then reaches out to the multiple servers and
gets real data blocks. Hadoop provides a command line interface for administrators
to work on HDFS. The top cluster node has a built-in Web server that allows
users to browse the HDFS files system and view some basic statistics. The HDFS collects the outcomes from other data nodes to form
the final result dataset (Apache Software Foundation, 2014).
Figure 2:
HDFS’s simplified block diagram.
Source: Adapted from Borthakur (Apache Hadoop
Organization, 2012).
Figure 3 shows how regular data is processed in traditional data warehouse along with processing
Big Data for data science in ETL process.
Source: Adapted from Intel, 2016.
CRM: Customer Relationship
Management
ERP: Enterprise Resource Planning
ETL: Extract, Transform, and Load
process
Sqoop, Flume, ODBC JDBC: Hadoop
analytical tools
C. Part II: Big Data
With the
advent of new technologies, devices, sensors, social networks, communications,
etc., the amount of data generated by humans and organizations has grown
rapidly every day. In 2003, there were 5
billion generated gigabytes. Humans generated the same amount of data in two
days in 2011, and in every ten minutes in 2013
(TutorialsPoint, 2016). Data that is called Big Data is a
collection of massive datasets that can only be
processed with the advanced analytical and statistical tools.
1. Sources of data
Big Data are generated from different devices and
applications.
- Black box
data
Black boxes
from airplanes, helicopters, jets, etc. contain voices of the flight crew,
microphones, earphones, etc.
- Social
media data
Social
media networks Facebook, Twitter, Linkedin, etc. contain streaming data, online
conversations, users’ posts from millions of people.
- Stock
exchange data
The stock
exchange data have information on share
trade in stock markets.
- Power
grid data
The power
grid data hold information from consumers and power stations.
- Transport data
The
transport data include information of a vehicle such as model, capacity, etc.
- Search
engine data
Search
engines such as Google, Yahoo, etc. retrieved a lot of data and
information from various databases.
2. Data types
Big Data
can be characterized into three typical types
(AllSight, 2016).
-
Structured data:
Relational data can be stored and
analyzed in RDBMS. They include POS data email, CRM data, financial data,
loyalty card data, help desk tickets.
- Unstructured data:
This
type of data is the most difficult to deal with. Data are generated from GPS,
blogs, mobile data, PDF files, web log data, forums, website content, spreadsheets,
photos, clickstream data, RSS feeds, word processing docs, satellite images,
videos, audio files, RFID tags, social media data, XML data, call center
transcripts, etc.
- Semi structured data:
XML data are data in the form of XML or JSON format.
Figure
4: Typical data types of Big Data
Source: Adapted from AllSight, 2016.
3. Benefits of
Big Data
Big Data
are emerging as one of the important technologies on the data-driven market. Big Data carry some typical
benefits.
a. Marketing
agencies monitor data and information in the social networks, e.g., Facebook,
Twitter to learn about the response to
their promotions, campaigns.
b. Product
companies and retail organizations can plan their production by using information
such as preferences and product perception of customers.
c.
Hospitals, insurance agencies can provide better and high-quality services based on data from patients’ previous medical
records.
d. Big Data
technologies provide more accurate analysis that may lead to more concrete
decision-making for cost reductions, operational efficiency, and reduced
risks.
4. Big Data
technologies
With
benefits of Big Data listed above, scholars and data scientists from Tutorials Point
(2016) distinguish Big Data technologies into two categories:
a.
Operational Big Data
Operational
Big Data technology includes a system
like MongoDB, NoSQL systems that provide operational capabilities for real-time interactive workloads. Data are
mostly captured and stored for cloud computing similar to RDBMS.
b.
Analytical Big Data
Analytical
Big Data technology comprises system such as Massively Parallel Processing
(MPP) database systems, and MapReduce that have analytical and statistical
capabilities for complex analysis
These two
classes of technology perform complimentarily and frequently. Both can be deployed together. The table below shows
a comparison between two classes of technology.
Table: Comparison
between Operational and Analytical Systems
Source: Adapted from Tutorials Point, 2016.
5. Big Data
Challenges
Big Data
pose main challenges to many organizations as follows:
a. Capturing
data is difficult.
b. Curation
is not easy.
c. Storage
requires huge memory, disks.
d. Sharing
data is complicated.
e. Transfer
data take a lot of time because of huge size.
f. Analysis
of data requires advanced analytical tools.
g. The presentation is sophisticated.
Notice that
to tackle these challenges; organizations
often use enterprise servers in a large scale configuration.
D. Part III: Handling
Big Data
Big Data
mean massive volume in many forms. The traditional tools such as RDBMS are often
out of touch. The huge size of data file becomes a big issue for analysis and
statistics in many organizations.
1. Manageable
size of data
Even though Hadoop can perform
analytical analysis on data with accurate results, it still has some limitation
in run time (Stucchio, 2013). For a
flash drive of less than 100 MB, it is better to use traditional RDBMS, and
Microsoft tools like Excel, Access, etc. for queries and analysis. For the size of hundreds of megabytes, the data file is not Big Data but is too big for
Excel. Users may use Pandas built on the top of Numpy or Matlab, R for
analyzing fairly large data. For processing data files of 100GB, 500GB or 1 TB,
users may use two tetra-byte hard drive
(costs $94.99) or 4 tetra-byte (costs $169.99) and install and use Postgres.
However, with a data file of 5
tetra-bytes, users may have to use Hadoop because other choices such as big
servers or many hard drives are considerably more expensive.
2. When to use
Hadoop?
According
to Burns (2013), the company uses Hadoop for large distributed data tasks when
time is not a constraint. For example, it is useful to execute overnight
reports to review daily transactions or scan historical data dating back
several months. To execute the real-time analytics in meta-markets, Hadoop is
slow because it optimizes the
batch jobs by checking every file in a database. So, Hadoop
limits its value in an online environment
that requires a fast response and crucial performance. MongoDB NoSQL database is preferred for fast, responsive results in this situation.
3. Dealing with
Big Data
a. Large
data sets
Hadoop’s HDFS
handles applications with large datasets. Typical size is in gigabytes to
terabytes. It provides high aggregate data bandwidth and a single cluster of
hundreds of nodes. It can support tens of millions of file a single instance
(Borthakur, 2012).
b. Streaming data access
The data
sets in applications for Hadoop must be streaming accessed for high throughput.
The low latency of data access may be significant in the operation. Data files are prepared for batch processing and ignorant interactive use.
c. Simple
coherency model
Data files will be used in a
write-once-read-many access model. A file cannot be changed after created,
written, and closed.
d. Moving
data is more costly than moving computation.
Computation
of an application on huge data file should be
executed near data. It is better to set the computation closer to where
data reside.
e. Data
integrity
Data can be
stored reliably in HDFS even there is a failure
in name node, data node, and network
partitions. Checksum mechanism in HDFS verifies corrupted data chunks in data
nodes.
f. Data
organization
- Data
chunks
The
data file is divided into data
chunks or data blocks in HDFS. HDFS supports write-once-read-many semantics on
files. A typical block size used by HDFS is 64 MB or 128 MB. Each chunk will reside in a data node.
- Staging
The name
node inserts a file name into the file system and allocates a data chunk for it
and other operations.
g. Data
accessibility
Applications
can access data in HDFS in many ways. Users can use Java API, C language
wrapper, or HTTP browser. Massive data sets should be stored in the underlying storage and infrastructure platform for
the database that is capable of handling the capacity and speed of big data
initiatives, particularly for mission-critical applications
Database servers should be built
on solid-state storage. Storage infrastructure for Big Data should support automated tiering, deduplication, compression,
encryption, erasure coding and thin provisioning. For faster networks, 10 Gb or
100 Gb Ethernet bandwidth should be used on
the platform.
4. Preparation of Big Data
In order to process data properly in Hadoop, the data such
as big data must be prepared in the right format. They should be cleansed or
filtered out in appropriate forms before feeding them in Hadoop for data
processing. According to Wayner (2013), Hadoop ecosystem as described briefly
in Part I provide many software tools that can be used for cleansing data. Some
of them are described as follows:
a. Ambari
Ambari is GUI wizard to set up clusters of nodes to
control a cluster of Hadoop tasks.
b. HDFS
The HDFS is described in detail in section Part I, Item 3
above.
c. HBase
Hbase handles data in big tables similar to Google’s
BigTable. It stores the big tables of data for sharing among multiple nodes.
d. Hive
Hive processes files in HBase by extracting bits for
snippets into a stash for query tasks.
e. Sqoop
Spoop as a command-line tool provides large tables from
the RDBMS’s to control tools such as HBase, or Hive.
f. Pig
Piq handles the data in its own Pig Latin language in
parallel in the cluster.
g. ZooKeeper
ZooKeeper uses a file hierarchy and contains the metadata
for other nodes for task synchronization.
h. NoSQL
For NoSQL data,
this tool stores and retrieves data in NoSQL databases such as
Online
MongoDB, Cassandra, or Riak.
i. Mahout
Mahout provides algorithms’ implementations for filtering,
data analysis or classification to clusters.
j. Lucene/Solr
Lucene handles indexing large blocks of unstructured text
in Hadoop where Solr integrate the ability to parse XML format.
k. Avro
Avro wraps the data together with the schema in
serialization manner. Data can be parsed in XML or JSON format.
l. Oozie
Oozie combines the small tasks that are broken up from the
job. It manages a workflow properly.
m. GIS tools
GIS (Geographic Information System) is used to handle
geographic maps or images in three dimensions.
n. Flume
Flume gathers all data and information that will be stored
in the HDFS.and they are ready for analysis.
o. SQL on Hadoop
Data in SQL format such as HAWQ, Impalla, Drill, etc. can
be used for an ad-hoc query in simple SQL.
p. Clouds
A cloud platform can attract Hadoop jobs in rent machines
to crunch on big data sets in the shortest time.
q. Spark
Spark can process data as Hadoop but at the fast speed
where data are in cache.
E. Summary
This Unit 3 Individual Project presented an in-depth
report on Big Data. It was divided into three sections:
- Part I described Hadoop, its basic
components, architecture.
- Part II examined Big Data in various
characteristics.
- Part III discussed how to handle Big
Data in appropriate formats before Hadoop processed them that included
manageable size, streaming data access, data integrity, data staging, preparing
data, etc.
In summary, Big Data are raw data like crude oil that
needs to be refined before they can be used in applications for Hadoop’s
analytically processing.
REFERENCES
Apache Software Foundation (2014). What is apache hadoop? Retrieved November 08, 2015 from http://hadoop.apache.org/
AllSight (2016).
Extending the Hadoop architecture. Retrieved August 07, 2016 from
http://www.allsight.com/hadoop-architecture/?gclid=Cj0KEQjwuJu9BRDP_-HN9eXs1_UBEiQAlfW39pml8hZvAr6kW2zirnq6SRAiHx47kbyh6BGsk0zUY48aAlUF8P8HAQ
Borthakur, D. (2012). HDFS architecture. Retrieved August
08, 2016 from
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Streaming+Data+Access
Burns, E. (2013). Handling
the hoopla when to use Hadoop and when not to. Retrieved
August
08, 2016 from
http://searchbusinessanalytics.techtarget.com/feature/Handling-the-hoopla-When-to-use-Hadoop-and-when-not-to
Gartner Group (2013). Gartner
predicts business intelligence and analytics will remain a top focus for CIOs
through 2017. Press Release. Las Vegas, NV. Retrieved June 4, 2015 from
http://www.gartner.com/newsroom/id/2637615.
Intel, (2015).
Extract, transform, and load big data with apache Hadoop. Retrieved August 9,
2016 from http://hadoop.intel.com.
https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf
Machlis, S. (2011). 22 free tools for data
visualization and analysis. ComputerWorld. Retrieved August 8, 2016 from http://www.computerworld.com/article/2507728/enterprise-applications/enterprise-applications-22-free-tools-for-data-visualization-and-analysis.html
Natarajan, R.
(2012). Apache hadoop fundamentals – HDFS and mapreduce explained
with a diagram. Retrieved November
07, 2015 from http://www.thegeekstuff.com/2012/01/hadoop-hdfs-mapreduce-intro/
Stucchio,
C. (2013). Don't use hadoop - your data
isn't that big. Retrieved August
07/2016 from https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
TutorialsPoint, (2016). Hadoop – big data overview.
Retrieved August, 8, 2016 from
http://www.tutorialspoint.com/hadoop/hadoop_big_data_overview.htm
Wayner, P.
(2013). 18
essential hadoop tools for crunching big data. Retrieved August 8,
2016 from http://www.infoworld.com/article/2606340/hadoop/131105-18-essential-Hadoop-tools-for-crunching-big-data.html
I just want to know about Handling big data for hadoop and found this post thest post is perfect one ,Thanks for sharing the informative post and able to understand the concepts easily,Thoroughly enjoyed reading
ReplyDeleteCheck out : Also Check out the : https://www.credosystemz.com/training-in-chennai/best-hadoop-training-in-chennai/
It is really nice to see the best blog for HadoopTutorial .This blog helped me a lot easily understandable too. Hadoop Training in Velachery | Hadoop Training .
ReplyDeleteHadoop Training in Chennai | Hadoop .
Your blog is very informative for the Big Data Hadoop learners. I like your blog. Thanks for sharing.
ReplyDeleteHadoop Big Data Training in Pune
This is very well information thanks for sharing such like of information
ReplyDeleteHadoop big data classes in pune
Big data institutes in pune
Big data certification in pune
Data science training in pune
Data science institutes in pune
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Big Data Hadoop Online Training
ReplyDeleteThank you for your post. This is superb information. It is amazing and great to visit your site.
ReplyDeleteData Science Training in Noida
The different iot solutions offered by this blog had helped me in visualizing the multiple iot analytics consulting services to achieve efficient success in business.
ReplyDeleteReally It is very useful information for us. thanks for sharing..
ReplyDeleteBest AngularJs Training in Pune | RPA Training in Pune | Devops Certification Pune
Google cloud big data services should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.
ReplyDeleteGreat article,keep sharing more posts with us.
ReplyDeleteThank you...
big data online training
hadoop admin course
It was wonerful reading your conent. Thankyou very much. # BOOST Your GOOGLE RANKING.It’s Your Time To Be On #1st Page
ReplyDeleteOur Motive is not just to create links but to get them indexed as will
Increase Domain Authority (DA).We’re on a mission to increase DA PA of your domain
High Quality Backlink Building Service
Boost DA upto 15+ at cheapest
Boost DA upto 25+ at cheapest
Boost DA upto 35+ at cheapest
Boost DA upto 45+ at cheapest
Dream pass interesting response example position such. Inside safe material tend full.entertainment
ReplyDelete