Literature Review: Transforming Big Data into Quantitative Insights
Chapter One provides an overall overview of the research study that
contains the background information and framework. It finds that Big Data with
5 V’s is an organizational asset may contain common hidden patterns, variables
correlation, and frequently predictive occurrence of events in healthcare. Mining
Big Data is the tremendous challenge to many organizations. The impact of big data analytics on Big Data
is little known in academic community and industries.
In Chapter Two, the paper will explore Big Data, particularly five
dimensions (volume, variety, velocity, veracity and value) as a new frontier in
business. Big Data becomes a socio-technical phenomenon. The complexity of the
huge data sets in healthcare will be addressed. The paper will examine the big
data analytics with the up-to-date advanced analytical tools on the Big Data’s
challenges and data access issues in both public and private sectors. The
outcomes of data analysis are the insights that may drive a new science in big
data, the evolution of wisdom.
The standardized big data analytics
software tool will be suggested for analyzing Big Data.
In a review of the literature
in-depth, many documents such as academic, and practitioner articles or
journals from various accessible sources are collected and examined. Other
documents like agendas;
attendance registers, minutes of meetings; manuals; background paper; books and
brochures; diaries and journal; events programs; letters and memorandum; maps
and charts; newspapers; press releases; survey data; and various public records
are also verified (Bowen, 2008). Some advanced analytical and statistical tools in Big Data Analytics used
to collect, analyze, and extract Big Data for meaningful information, holistic
knowledge and professional wisdom are also evaluated for making a practical and
strategic decision or gaining the competitive edge in the dynamic market
locally and globally (Imanual, 2015; Lurie, 2014). This literature review will focus on three
primary components: Big Data, Big Data Analytics, and Insights. These
components will be discussed in-depth based upon the research topic of retrieving insightful information from big
data, particularly data in healthcare, for pattern correlation or the frequent
predictive occurrence of events in the evolution of DIKW (Data, Information,
Knowledge, and Wisdom (Ahlemeyer-Stubbe, & Coleman, 2014) in Chapter
Two of the research dissertation. It provides a critical review of the
theoretical and contextual literature, and graphic of the conceptual framework
of the proposed dissertation.
Big Data
With new data computing, automation, Web technologies in the competitive data-driven market and
Internet-based economy, data at low storage cost and fast processing have
exploded and become ubiquitous and ample in both public and private sectors
(Chen, Chiang, & Storey, 2012; Richards, & King, 2014). Big Data, a generic
term for data pose major challenges of extracting or transforming complex data
into insightful information for many organizations (Gartner, 2013). Big
Data is a new paradigm that combines five characteristic dimensions: volume,
velocity, variety, veracity, and value (Goes, 2014; Jacobs, 2009):
- Volume:
One
of the primary characteristics of Big Data is its massive volume. The size of a
data set can be from terabytes to petabytes or zettabytes. Storing the enormous
amounts of data becomes a real problem for many organizations, particularly
mid-size or small-size companies. Exploring and understanding the big data is a
technical issue for a lot of users (Economist, 2010). For example, Walker (2015), a marketing
executive at Vouchercloud, estimated 100 GigaBytes (GB) of data generated per
second in 2002; 28,875 GB of data per second in 2013, and 2.5 quintillions (or
2,500,000,000,000,000,000) bytes of data generated every day in 2015.
- Variety:
Big
data is usually in various forms in multiple formats. It can be categorized
into structured, semi-structured, unstructured, and metadata types for control
and processing objective. Big Data are
generated by humans, different devices, and applications (Dell Midmarket Research, 2013). For instance,
the data is generated by human operations from the black box, social media,
stock exchange, power grid, transport data, search engine. Data that contain
voices of the flight crew, microphones, earphones, etc. contain voices of the
flight crew members, earphones, and microphones are kept safely in the black
boxes of the airplanes, helicopters, jets. Data also include streaming data,
online conversations, users’ posts from millions of people.in social media
networks like Facebook, Twitter, and Linkedin. Data may have information on share trade in the stock exchange in the
market such as Dow Jones, Nasdaq, or S&P 500. The power grid data hold
information from consumers and power stations. The transport data are information
of a vehicle such as model, capacity, weight, specifications, etc. A lot of
data and information from various databases are generated by search engines
like Google, Yahoo, IEEE, and ACM.
- Velocity:
Recently,
many analytical tools are
available for retrieving meaningful information from big data. For instance,
Jones (2014) listed the top 10 data analysis tools for business that include
Tableau Public, OpenRefine, RapidMiner, KNIME, etc. Machlis (2011) discussed 22
free analytics tools including data visualization for analyses and
presentations. Lurie (2014) summarized 39 analytics tools for big data
visualization in cloud computing. These advanced tools can usually execute
large data sets and support cyclic data flow in in-memory computation at dynamic ultra-fast speed. Apache Spark’s execution engine can run programs at 100 times faster Hadoop MapReduce
or 10 times faster on the disk. In the
conference (PAPIs.io, 2016) on machine learning, Steenbergen (2016) presented the
possibilities of distributed deep learning which can be image analysis, image
generation, most famous, learning and playing Go by distributing training,
computation over a Spark cluster. Spark allows users to next level of big data
processing, innovating data science. Amazon Web Services’ EC2 has GPU instances
that users can span on-demand for a dollar per spot instance about two to three
times cheaper than other instances. A
deep learning framework is Berkeley’s Caffe or
Flickr. Users can run an existing cluster alongside other Spark jobs. Spark
allows users to train multiple models at once and even leverage existing models
in deep learning with Spark.
- Veracity:
Veracity
is a truthfulness of the information extracted from big data. The real meaning
of information is important for managers and business strategists who have the
responsibility to make a concrete decision in business that could lead the
companies to success or failure. Their decision and vision in business are
extremely important for the companies (IBM, 2011). A case study at USC
(University of Southern California) Annenberg Innovation Lab from IBM helps it
to find the veracity of Big Data in Analytics. The USC Annenberg Innovation Lab
wants to uncover insights buried in the millions of daily online conversations
or streaming data. It uses IBM Analytics tools to capture, collect and analyze
these massive data in various forms in tweets, Facebook posts in different
fields for the trends in almost-real time. The Lab applies sentiment analytics,
social media analytics, predictive analytics to demonstrate the impact of TV ad
within a day of airing, help show sentiment of debate viewers in real time, and
expect to enable countries to predict early notice of potential health crises
or civil unrest (Smith, 2013).
- Value:
The
value of Big Data is a real-time usefulness of the data under use. Its
usefulness is similar to veracity in making a decision (Wright, 2014). Managers
evaluate the value of data for its worthiness or importance in considering the
responsive decision (Snider, 2012). Data in motion are the spontaneous
values at the moment that are created on the fly. Data in motion turn an event
to insight at the moment. Their most value is the moment of truth created at
that point in time. The USC scholars
used in a sentiment analytics project with IBM analytics tools to collect,
capture, and analyze massive data in various forms for insights. The solution
at USC Annenberg Innovation Lab is a success on the application of big data
analytics in
near-real time by analyzing millions of social media conversations (Smith, 2013). The Illinois healthcare agency
creates a comprehensive enterprise data warehouse, EMPI (enterprise master
patient index) to display the patient’s record view in full review
capabilities. EMPI provides each patient’s insightful information from the
medical records collected from multi-sources across multiple agencies such as a
hospital, health insurance, pharmacy, lab, drugstore, etc. A new analytics
platform and WebFOCUS business intelligence (BI) created by Information
Builders provide analytical queries and analysis reports to allow users such as
clinicians, administrators, etc. to monitor critical metrics for performance
management purposes. EMC® Greenplum® Data Computing Appliance (DCA) and
Greenplum™ Unified Analytics Platform (UAP) aid an enterprise-wide view of its
patient data. By implementing EMPI and DCA, the Illinois healthcare provider
can increase the healthcare quality and reduce the costs by tackling the
separate nature of data among clinical departments, hospital systems, labs, and
clinical applications.
Data can be
categorized into quantitative and qualitative data that can store some valuable
information content. Quantitative data can have different forms such as nominal
data, ordinal data, binary, scale, and metric. Qualitative data comes from
surveys, interviews, online questionnaires, etc. (Brown, Chui, & Manyika,
2011). In real world perspectives, data is a fundamental representation of
facts without context by human observation. Ahlemeyer-Stubbe et al., (2014)
defined data as facts, figures pertinent to the customer, consumer behavior,
marketing, and sale activities. Data becomes an essential element for the
products that store information about the relationships among systems, sources,
etc. and it is managed in a centralized environment. Data that has been
collected and stored in firms’ IT database management systems includes two
primary types: (1) internal and (2) external data. Internal data of the
organization is generated from different processes to handle the daily
business. Its quality and reliability are in control by the organization
(Mayer-Schönberger, & Cukier, 2013). For example, data on specifications or
invoice of products is internal data,
On the other hand, external data that
is generated outside the organization’s own processes often contains
discrepancies and is used as additional data or as reference values such as
credit rating (Huck, 2015). Some huge data sets are noisy. Big noisy data is
defined as big data with corrupted electronic signals, erroneous in some
processing steps, or unstructured data that cannot be interpreted by machines.
Noisy data is meaningless data. Hardware failures, programming errors or gibberish
input from speech or optical character recognition (OCR) programs can generate
noisy data. Noisy data increases storage space and affects the results from
data mining analysis. Data analysts or data scientists can use statistical
analysis to filter out noisy data.
AllSight (2016) characterizes Big Data into three typical types: (1)
structured data, (2) unstructured data, and (3) semi-structured data. Structured
data is relational data can be stored and analyzed in RDBMS. They include POS
data email, CRM data, financial data, loyalty card data, help desk tickets. Unstructured data are the most difficult to deal with.
Unstructured data are generated from GPS, blogs, mobile data, PDF files, web
log data, forums, website content, spreadsheets, photos, clickstream data, RSS
feeds, word processing docs, satellite images, videos, audio files, RFID tags,
social media data, XML data, call center transcripts. And semi-structured
data are data formatted in the form of XML or JSON form.
In the last five years, Big Data has been emerging as a contemporary
boundary for innovations of the information technology (Gartner Group, 2013).
It offers new opportunities in the evolution of DIKW, particularly
revolutionary information for both public and private sectors. In many
organizations, data and information may be used interchangeably with a vague
distinction, particularly in computer science (CS) and economics. CS scientists
view information as coded data while economists consider information as
additional knowledge not stored in the data system. Data is equated with
information it presents (McNurlin, Ralph, Sprague, & Bui, 2009). Recently, scholars categorize
Big Data into “data at rest” and “data in motion” (Ebberg, 2013). Data at rest
or traditional data is static or inactive data that contain values collected
and stored in servers, computers, or databases to be analyzed later for decision making. It includes files, backup
tapes, tables, patient records, etc. On the other hand, data in motion, data in
transit, or data in use is dynamic data processed by analyzing it on the fly at
the real time in the network or in the cloud servers without storing it in the
hosts. Data in motion may flow over the public or untrusted network, e.g., the
Internet, in the confined private
network, for example, Intranet, corporate LAN (Local Area Network), WAN (Wide
Area Network) (Moore, 2014). Data in
motion is the spontaneous values at the moment that are created on the fly.
Data in motion can turn an event into insight at the moment the event occurs.
Their most value of the data is the moment of truth created in time at that
point of the event (Nixon, 2013). For example, the Annenberg Innovation Lab at the University of Southern
California used IBM analytics, e.g., IBMInfoSphere Streams and IBM BigSheets to
uncover insightful feelings (Like or Unlike; agree or disagree) of the target
audience at the real time in the presidential debate in 2008. The
data comprise emails, the web, Internet
protocols. Processing data in motion instantaneously requires the advanced
analytical tools such as IBM InfoStreams, Tableau, Hadoop, or Apache Spark
(Wayteck, 2011).
The enormous mountain of data with hidden treasures generated by different
devices, sensors, and applications forces many organizations to focus how to
control, extract, transfer, and load Big Data (Carter, 2014). Big Data
carry some typical benefits (AllSight, 2016). For example, marketing agencies,
banking systems monitor data and information in the social networks, e.g.,
Facebook, Twitter to learn about the response to
their promotions, campaigns (Capgemini, 2013). Product companies and retail
organizations can plan their production by using information such as
preferences and product perception of customers (IBM Analytics, 2015).
Hospitals, insurance agencies can provide better and high-quality services based on data from patients’ previous medical
records (Nolan, 2015). Big Data technologies provide more accurate analyses
that may lead to more concrete decision-making for cost reductions, operational
efficiencies, and reducing risks. Most
companies search for an alternative way to concrete decision making for
successful data-driven strategy in the competitive market while academic and
scientific communities seek to understand the business economy in-depth.
Industrious leaders, academic scholars, and data scientists all hold high
expectation in Big Data to leap the society with more wealth, prosperity into a
new frontier of innovations (Abhishek,
& Arvind, 2007).
Boyd and Crawford (2012) projected Big Data on the rise as an
organizational development that employs the interactions between humans and
technology. Big Data becomes a social-technical phenomenon because a
scheme of arrangement and process of complex work design that employs
the interaction between humans and technology in the workplaces (Long, 2013).
The social-technical system refers to the interaction between complex
infrastructures and human behaviors. It is about joint optimization such as
interrelatedness of social and technical aspects of an organization or the society as a whole (Trist, &
Bamforth, 1951).
Jacobs (2009)
did a study of the pathologies of Big Data and found that it was difficult for
humans to analyze the stored unstructured data with the large size spreadsheets
and to extract data out from traditional database management. Using the
pathology approach in Big Data by examining a sample of significant data did
not work. Jacobs’ study represented how difficult it is to handle Big Data due
to the limitation among rows by columns in the existing spreadsheet. Some differences between conventional data
and Big Data are summarized in the
table 2.1 as shown below:
Table 2.1 The differences between Conventional Data and Big Data.
Source: The table was synthesized and built for this paper
by this student (2016)
Scholars raise the debatable questions on critical interrogating
assumptions and biases on a social-technical phenomenon of the Big Data
(Mayer-Schönberger, & Cukier, 2013). Six provocations to ignite
conversations on the Big Data include phenomenon, culture, scholarship,
technology, analysis, and mythology. Many scholars and leading scientists
provoke for an international conference to discuss and learn about Big Data in
extensive utopian and dystopian bombast (Thomson, 2010). Since Big Data is a socio-technical phenomenon,
it is worthy of a robust research study.
To narrow down the huge Big Data in the proposed research dissertation,
data in healthcare was studied in three Internet-based, participatory, cloud,
and mobile domains: (1) Personal health information (PeHI), (2) Clinical
health information (CHI), and (3) Public health information (PuHI)(Schneiderman, Plaisant,
& Hesse, 2013). Statistics showed that unstructured data occupied about 70%
of an organization’s data asset (AllSight, 2016).
Personal health information is
the records that healthcare providers and patients collect for information
about their own health habits and practices. Monitoring human body with sophisticated
sensors enables physicians and nurses to understand pro and con in treatments (HIPPA Act, 1996). Based on personal
health information such as patient medical activities, clinical health
information like electronic health records systems, and public health
information like public health data, data researchers focus on healthcare data
and data analytics that holds the promise to improve the quality of healthcare
delivery, and contains the potential to enhance patient care, save lives, and
lower the treatment cost. There are many advantages of using big data to
healthcare in clinical operations, research and development, public health,
evidence-based medicine, genomic analytics, pre-adjudication fraud analysis,
patient profile analytics, etc. (Raghupathi,
& Raghupathi, 2014).
Clinical health information is
electronic health records systems that improve patient care and valuable
insights into treatment patterns. With outcomes from data visualization,
hospitals and universities continue to improve nursing and physician training
programs. However, training physicians for what they should know is
increasingly difficult because the large scope of knowledge on specialized
cases, various medications, and professional guideline are rapidly changed from
results (Quora, 2014).
Public health information is a
large amount of collected public health data that assists policy makers on more
reliable decisions from the US National Center for Health Statistics, Centers
for Disease Control, Census, World Health Organization, etc. However, using the
health information to derive insights remains a challenge (Agadish et al.,
2014).
Digging into healthcare data, Raghupathi and Raghupathi (2014) discovered
that BDA on Big Data in healthcare could make a significant impact upon various
fields in health care. The positive outcomes could include detecting the
diseases at earlier stages; managing individual and population health
efficiently; detecting healthcare fraud more quickly; estimating a large amount
of historical data such as length of stay, choosing elective surgery, and no
benefit from surgery; patients at risk for medical complication; patients at
risk for advancement in disease stages; pinpointing patients who are the
greatest consumers of health resources. The promising results comprised causal
factors of illness progression; providing patients with the information for
making informed decisions; managing patients’ own health; tracking healthier
behaviors; identifying treatment, reducing re-admissions by lifestyle factors
that increase a risk of the adverse event; improving outcomes by examining
vitals from at-home health monitors; and managing population health by
detecting vulnerabilities within the patient population during disease
outbreaks.
For Big Data in healthcare,
disease treatment and healthcare services are in progress, but they are lagging
in a slow pace. They do not keep up with the exponential spread of diseases and
illness, especially on the elderly in the society. A lot of diseases have no
cure such as AIDS, Alzheimer’s disease, and various cancers. The gap in modern
treatment, efficient cure, and effective prevention still exists in healthcare
services and health institutions.
Big Data Analytics
Today, society constantly continues changing,
especially in technology such as Big Data Analytics (BDA). Big Data Analytics (BDA) is a systematic
process of evaluating large amounts of data in varying types for the purpose of
identifying hidden patterns, variables relationships, unknown correlations, market
trends, customer preferences, and other useful information such as diagnosis of
illness or detection of fraud (Taylor, 2015). Data analytics is an extent of transforming
data into insightful information. Many advanced approaches, vigorous
techniques, great models, and infrastructures are employed to retrieve desired
information. Recently, an
emergent trend of BDA becomes a popular demand in many fields: education,
manufacturing, marketing, politics, healthcare, security, defense, and
insurance. Demand in BDA provides
plentiful opportunities for employment
for big data talents who possess highly analytical skills in many organizations
(Sondergaard, 2015). However, the abilities in extracting
information still encounter the limitation in organizations (Snijders, Matzat, &
Reips, 2012).
With benefits of using Big Data
Analytics to mine Big Data for insights, scholars and data scientists from
Tutorials Point (2016) distinguished Big Data Analytics technologies into two
categories: (1) Operational BDA, and (2) Analytical BDA. Operational BDA
technology includes NoSQL systems like
Amazo DynamoDB, Cassandra, Infinite Graph, MongoDB that provide operational
capabilities for real-time interactive
workloads. Data are mostly captured and stored for cloud computing similar to
RDBMS (Relational database management system). On the other hand, Analytical
BDA technology comprises systems such as Massively Parallel Processing (MPP)
database systems, MapReduce system, Hadoop ecosystem, or Apache Spark that have
analytical and statistical capabilities for extremely complex analyses. These
two classes of BDA technology perform complimentarily and frequently. Both can be deployed together, and enhance each
other. The table 2.1 below shows a comparison between two classes of technology
in Big Data.
Table 2.2 shows a characteristic comparison between Operational and
Analytical Systems.
Source: Adapted from Tutorials Point, 2016
Handling Big Data in a traditional approach such as centralized database
server systems, RDBMS (Relational Database Management) with Microsoft Excel,
Access no longer works (Connolly, & Begg, 2014). To analyze Big
Data in colossal volume and various forms, many organizations such as Amazon,
Apple, Google, IBM, Intel, Microsoft, Intel, etc. have developed their own
high-tech statistical tools or used the advanced analytical tools (Brandon,
2015). Some typical
analytical tools are AWS (Amazon Web Services), Tableau, R, Apache Spark
(Machlis, 2011). In business
applications, the range and strategic impacts of BDA on Big Data are vast. Applications
of BDA are used in many fields: healthcare, automotive, presidential campaigns,
highway traffic, insurance, banking, social networking, law enforcement (Natarajan,
2012).
BDA can be used in Internet
search engines, e.g., Google or Yahoo, and social media networks, e.g.,
Linkedin, Facebook, Twitter, etc. to collect, capture and analyze online
conversations, streaming data, users’ posts for learning human behavior or
actions on their homepages. BDA can detect fraud and spam, ameliorate website
design, and explore advertisement opportunities (Clark, 2015). For example,
modern astronomical telescopes, genome sequencers, physics particles accelerators
generate a vast amount of data for BDA.
One example of BDA is using Google Fusion Tables. Google
Fusion Tables is a Web-based service for data management used to gather,
visualize and share data tables. Data are captured and stored in multiple tables
for viewing or download. It provides dataset visualization and mapping. Its
platform is a browser such as Chrome, Netscape, etc. (Halevy, & Shapley,
2009). Data can be displayed visually in different forms such as bar charts,
pie diagrams, line plots, timelines, scatter plots or geographical maps. The
data can be exported in a comma-separated values file format. It has a skill
level 1 for users who have some basic spreadsheet knowledge. Google Fusion
Tables is free and easy to use. Another example of BDA applications is
QlikView. QlikView has the ability of simple drag and drop techniques in
self-service in the creation of data visualization without writing many SQL
query commands. Qlikview can connect various databases from different vendors
into QlikView's centralized repository. It has intelligent indexing method to
discover new data for patterns and trends in different data types. QlikView
provides dashboards to aid decision support systems. Its platform uses 64-bit
Windows with a skill level of 2 (Qlik, 2015). QlikView accepts dynamic data
type formats from any source to its in-memory analytics platform. It has many
channels of documentation for building big data quickly without disruption
without downtime. Also, IBM Watson is a question answering computing for
machine learning, retrieving information, presenting knowledge, and automatically
reasoning. It has the capability to find the correct answer after running a
hundred algorithms of proven language analysis. IBM Watson’s applications are
often used in financial services, telecommunication, healthcare, and
government, and game contests such as Jeopardy (Thomson, 2010). Users are not
required to know statistics because IBM Watson computes all in the background.
IBM Watson also provides visualization and analysis applications based on the
browser with the level skill of 1. IBM Watson is an analytics tool that has an
ability to retrieve major information from all documents, provide hidden
patterns, insights, and correlations across vast data sets. 80% of data are
unstructured in various forms such as new articles, online posts, research
papers, or organizational system data (Thomson, 2010). And it is a free tool.
According to Herodotou, Lim, Luo,
Borisov, Dong, Cetin and Babu (2011), data scientists and leading scholars
expect a primary breakthrough from distributed and grid computing in such data.
As a result, many disciplines, e.g., engineering or applied science, have
sub-branches in computing of biology, economics, or even journalism. Data
Analytics such as transforming data into insights has gained more popular
demand as a new trend in corporate strategy in many organizations. Today, Big
Data Analytics is a new corporate trend and the key success in business
(Herher, 2014).
Big Data poses enormous challenges. Many organizations, particularly
academic scholars, and data scientists encounter a great deal of barriers, difficulties in retrieving. Big Data for
insights are discussed and explored by many academic scholars, and data
scientists. The massive volume of Big Data cannot be stored properly in
traditional database systems such as RDBMS (Relational Database Management
Systems) (Sadalage, & Fowler, 2012). Unstructured data are generated
from GPS, blogs, mobile data, PDF files, web log data, forums, website content,
spreadsheets, photos, clickstream data, RSS feeds, word processing docs,
satellite images, videos, and audio files (AllSight, 2016). Big Data
under analytics performed at the lower cost and right time becomes a major
factor of success in the industry. The hardcore
science disciplines have worked on volume, and perhaps velocity (Brown
et al., 2011). However, a study of 5 V’s
altogether is an exciting and sophisticated challenge. Based on data,
meaningful information, logic knowledge, and wisdom extracted from big data,
researchers can build a theoretical framework of wisdom (Minelli, Chambers, &
Dhiraj, 2013). Organizing and converting unstructured data into
categories cause an enigma and a headache issue for data scientists. Some
typical challenges in Big Data retrieval are (1) capturing data is
difficult because of its massive size, (2) curation is not easy, (3) storage
requires huge memory and disks, (4) sharing data is complicated because it is
in various forms, (5) transfer data is time consumption because of huge volume,
(6) analysis of data requires advanced analytical tools and (7) the presentation of data results is
sophisticated and requires data visualization tools (Microsoft Power BI, 2016).
With data
explosion at 44 zettabytes by 2020 (Vizard, 2014), organizations have no
choice. They have to use BDA to maximize computing power and accurate
algorithms on the prevalent belief that big data offer a profound form of intelligence and
knowledge that can produce insights for a competitive edge.
Hadoop MapReduce and Apache Spark are two
popular analytics software tools used by many companies. Both tools can
complement each other or work together. For example, Spark can work on Hadoop Distributed File System (HDFS). Spark’s
applications can run much 100 times faster than these run on Hadoop
MapReduce because Spark uses RAM (in-memory) while Hadoop runs on the hard
disk. Hadoop MapReduce has a more flexible, wide-ranged options but Spark can
convert a big chunk of data into actionable information faster. Trovit is a
classified ads engine uses HDFS by using many smaller servers to solve the
storage problem with the huge amounts of data. However, when using Hadoop
MapReduce on HDFS, developers and users experience the inflexible application programming interface or
API and strictly on disk activities. Apache Spark offers a flexible and fast
distributed processing framework. Developers can run MapReduce code in
production on the Spark platform at high
speed and ease of use. For instance, Trovit team and Spark innovates a set of
libraries on top of the framework for rapid processing that is suitable on
their resources. Today, Trovit uses both Hadoop and Spark combo in renewed
flexibility in the language of the data, and the ability of parallel processing for effectivity and efficiency
(Riggins, 2016). Microsoft acquired Revolution Analytics; IBM bought SPSS; North
Carolina State University founded SAS, and Bell Labs founded S language in the
rise of big data analytics. R language is the next-generation innovation of S
language. R language has a set of generic libraries for various applications such as econometrics,
processing, natural language, etc. Similarly to XML, some R libraries target
industry issues and problems in clinical trials, genomics, genetics, insurance,
education, finance, manufacturing, and healthcare. R supports desktop and
server-based processing. It also performs parallelized processing within Hadoop
clusters like Apache Spark in data mining and data warehouses. R expands in other complex fields in biology,
genomics in statistics (Schmidt, 2014). Many organizations such as
Google, Youtube, FDA, The New York Times, Facebook, etc. have used R for graphical and statistical computing on big
data sets. Contributors in a fast growth community promote R applications in
Big Data processing and development in many organizations. R in rapid expansion
appears to take over SAS and SPSS-controlled market recently (Datacamp, 2014).
Many advanced approaches, vigorous
techniques, great models, and infrastructures are employed to retrieve desired
information. However, the abilities in extracting information still encounter
limitation (Snijders,
Matzat, & Reips, 2012). Notice that to tackle these challenges; major organizations often use enterprise
servers in a large scale configuration at a high cost.
For the rise of the cloud and distributed or grid computing techniques,
data scientists and information professionals play key roles in the assessment
of Big Data. They all know that Big Data contains enormous algorithmic
information and the devil in it (Herher, 2014). From medicine, security,
education, to politics, organizations that have the capability to use Big Data
for theories, scenarios, assessment on the past assumptions will gain a
competitive edge. Big Data can extend goals of life, liberty, and happiness (Lazari, 2016). However, the keys to open the chest of data are in the hands of private
companies such as NSA’s secret stores and online that always protect their
asset property. Users do not have the authorization to access the data
repositories, and they cannot claim ownership of their own data (Resnik, 2011).
People have no idea of the depth of data kept secretly by the government (Voosen,
2015). Accessing Big Data for
statistical analysis or analytics in the open environment is still a challenge
for information professionals (Laurila, Gatica-Perez, Aad, Bornet, Do, Dousse,
& Miettinen, 2012).
Insights
Insights are results or outcomes from data analytics work on big data. They
are useful information that is meaningful and valuable to organizations for
many business purposes such as assisting managers (1) to make sound and precise
decisions in business, (2) to improve business performance, (3) to increase
organizational productivity, and (4) to gain and sustain the competitive edge
in the dynamic market locally and globally.
In general, decision making traditionally is a participatory process for
several participants (about 5 to 10 persons) who collect information, analyze
problems or situations, weigh courses of actions, and select the best solution
for a problem in a wide range of business. The process used to arrive at
decisions can be structured or unstructured. Time pressure or conflicting goals
that are often external contingencies impact the development and
effectiveness of decision-making groups. Group decision methods include (a)
Delphi technique built by RAND
Corporation about six decades ago (RAND Corporation, 1950), (b) dialectical
inquiry, (c) brainstorm method, (d) nominal group technique. In the 1970s,
Sprague and Carson (1982) developed Decision Support Systems (DSSs) to aid
decision maker to confront ill-structured problems. Other information-centric
decision-making systems are Executive Information Systems (EISs), Expert
Systems (ESs), agent-based modeling and real-time CRM Customer Relationship
Management) (McNurlin, Sprague, &
Bui, 2009). Business Intelligence (BI) that also facilitates corporate
decision-making consists of data mining, data warehousing, and OLAP (online
analytical processing) (Connolly, & Begg, 2014). Other approaches in seeking decision making indirectly are think
tank or reflection pool such as Brookings Institution or Heritage Foundation; traditional forecasting and
contemporary scenario planning (Daniel Research Group, 2011; Seemann, 2002;
Wade, 2014). Up to date, the decision-making technologies continue
evolving rapidly from Big Data, Big Data Analytics to Artificial Intelligence
and IoT (Internet of Things). To gain a competitive edge in a dynamic
data-driven and Web-centric market, organizations use BDA to mine insights
within Big Data for better decision making in a wide range of applications
affecting from technology, economics to locality and globalization (Paulding,
2016).
From
the intensive review of the literature, it seems that there are many scholarly
authors, professional practitioners who describe, identify and discuss Big
Data, a hot emerging trend in the data-centric world. However, apparently, there is no one who mentions a
study of Science of Big Data, called Big Data Science. Big Data Science can be
a new educational branch such as Computer Science, IS (Information Systems) or
IT (Information Technology). Big Data Science is a discipline that seeks to
build a scientific foundation for such topics as Big Data Analytics, Big Data
Software, data/information/knowledge/wisdom processing, algorithmic solutions
of data-related problems, and the algorithmic process itself. The gap is there
is no available framework or foundation for Big Data Science.
The
topic of research is the evolution of wisdom that extracts big data (D) into
information (I), transforms into knowledge (K) and then constructs wisdom (W).
The DIKW evolution will establish DIKW model whose foundation is for Data
Management (DM), Information Management (IM), Knowledge Management (KM), and
Wisdom Management (WM) (Ahlemeyer-Stubbe, & Coleman, 2014). DM and IM have already been developed. However, KM and WM are relatively new and are not addressed
completely. These topics deal with knowledge and wisdom. Knowledge is the
processed information among individuals, individuals to groups, or across
groups while Knowledge Management is a process of coordinating of knowledge.
Differently from information, knowledge is an understanding of customers or
relationship with a notion of the idea that is acquired by study,
investigation, observation or experience not based on assumptions or opinions
(Ahlemeyer-Stubbe et al., 2014). If facts are about data and reporting is about
information, and then analytics is about knowledge. Knowledge is considered as intellectual capital (IC) or,
at least, part of IC, a valuable organizational asset that requires
identifying, managing, sharing and protecting for competitive advantage in the
marketplace. The wisdom that is constructed
from knowledge may be one of the most difficult subjects to deal with because
both knowledge and wisdom are in association with humans – the homo-sapiens.
Knowledge, wisdom extracted from Big Data that are not thorough, unclear are not characterized. It is not clear how to
manage these intellectual assets effectively and efficiently in the
organizations. The gap is a need to determine and manage knowledge and wisdom
effectively.
In the
past five years, many Big Data Analytics software tools available with data
explosion. Selecting a right BDA platform and appropriate software tool(s)
becomes critical to any organization regarding advanced technology,
implementation, deployment, friendly use, maintenance, training, customer
support, and cost (Cohen, Dolan, Dunlap,
Hellerstein, & Welton, 2009). According to Minelli, Chambers, and
Dhiraj (2013), big data analytics (BDA) is a scientific and systematic process
to evaluate the massive volume of data in various forms at high speed. BDA’s
objective is to identify specific patterns, insightful relationships, unknown
correlations, and other meaningful information. There are different BDA
software tools available in the industry. At least, twenty-two tools for data
visualization and analysis such as Tableau, R Project, QlikView, Hadoop
MapReduce, Apache Spark, etc. are available and free of charge for users
(Machlis, 2011). The gap here is no universal standardized data analytical
software tool available on Big Data for professional users (Patrizio, 2014).
Graphical conceptual framework of Big Data retrieval
Today, the high-tech society is constantly
changing exponentially, especially in technology (Grant, 2016). Data Analytics
is a data process to retrieve insightful information from Big Data. Many
advanced approaches, vigorous techniques, great models, and infrastructures are
employed to retrieve desired information (Jones, 2014). However, the
abilities in extracting information still encounter biases and limitations (Snijders, Matzat, &
Reips, 2012). Scholars, data scientists, and data analysts in many fields have
identified more gaps of knowledge in using Big Data for insights, particularly
in healthcare (Tutorials Point, 2016). For example, capturing, storing and analyzing Big Data are more
complex and limited due to its colossal volume size and various varieties of
forms. Big Data has different structured, unstructured, semi-structured types
in many different forms such as texts, blogs, mobile data, Web log data,
forums, audio and video files, images, etc. Control and maintenance of these
Big Data forms are often out of traditional methods (Jagadish, Gehrke,
Labrinidis, Papakonstantinou, Patel, Ramakrishnan, & Shahabi, 2014).
Digital curation such as preservation, collection, and maintenance of numerous
forms of Big Data usually encounters the bottleneck problems or freezing
systems issues. Sharing Big Data among organizations in the interconnected
networks is complicated due to different formats and platforms (Lohr, 2012).
Transferring data among computer hosts, and servers are very time-consumed,
freezing the system, and often interrupted. Performing analysis on Big Data
requires advanced analytics tools with highly skillful professionals (Baroni, &
2014). Also, presentation of analytical results or outcomes from the
analyses is sophisticated to the audience who do not understand due to
lacking of the technical background (Few, 2016).
Results and outcome of the analysis usually require some data visualization
tools such as Tableau or QlikView for display to the audience (Pandre, 2016).
In the advent of the
fourth-generation languages and personal computers in the 1990s and new
standard computing, automation and Web technologies recently, data at the low
cost of storage and processing becomes ample and ubiquitous (Cukier,
2015). With data explosion and big data
technologies in e-commerce, finance, insurance, healthcare, etc., data
ubiquity drives the evolution of data, information into particularly knowledge
and wisdom at ultra-fast speed (Erickson & Rothberg, 2014). Managing
information in business leads to a modern field Business Intelligence
(Anonymous, n.d.). Sharing knowledge among individuals and across groups leads
to new disciplines such as data science, knowledge management, or content
management (Birasnav, Goel & Rastogi, 2012; Nonaka, & Takeuchi,
1995). Therefore,
a study of extracting and transforming Big Data like values or numbers without context into
meaningful and statistical information for prediction of the events’ occurrences (e.g. earthquake, disasters,
healthcare DNA decoding, flu crises, etc.) is well worthy of a research because
it will immensely achieve four objectives: (1) making a practical and
strategic decision, (2) improving business performance, (3) increasing
organizational productivity, and (4) gaining and sustaining (taking advantage
of) the competitive edge in the dynamic market locally and globally (Davis,
2016; Gartner, 2016). Particularly
in healthcare, the study about retrieval of insights on personal, clinical, and
public health information will drive advancing behavior of genes, drugs, and
proteins then used to design new medicines that benefit
humans and animals (Goodfellow,
Bengio, & Courville, 2016). Those companies use Big Data Analytics to gain
competitive advantages and their own survival. Some typical advantages from Big
Data analyses in the evolution of DIKW are increasing business, improving operational efficiency, driving
revenues, acquiring new customers, and winning more market shares (Podesta,
Pritzker, Montz, Holdren, & Zients, 2014).
Based on the research topic, in-depth literature review, and
unknown gaps of knowledge, the conceptual framework for a proposed research
dissertation is developed and described as follows:
The graphic of the conceptual framework is created as a linear model in
Microsoft PowerPoint (MS pptx) document.
It is based on the literature review
document of more than ninety (92) credible articles and many papers on the topics
and subtopics. Each topic or subtopic has
entered the box. The topics or subtopics are displayed in thirty (30) topic boxes. These topic boxes are arranged into seven logically ordered groups:
1. Group 1:
This group explains the topic of
“What is the Big Data?” that describe, discuss, and explore Big Data, its
challenges, difficulties, obstacles in collecting, storing, analyzing,
processing, etc. (George, Haas, & Pentland, 2014).
2. Group 2:
This group discusses the topic of
“Big Data and Research” that discuss the knowledge gap, trends, and research on
Big Data. It highlights the capabilities of the Big Data on gaining knowledge
in the market, customers and demands (Bughin, Chui, & Manyika, 2010).
3. Group 3:
Group 3 addresses the topic of
“Benchmark on Big Data” that establishes a standardization of Big Data such as
the underlying business benchmark, data
model and synthetic data generator, etc. that focus on the variety, velocity
and volume aspects of big data systems (Baru, Bhandarkar, Nambiar, Poess, &
Rabl, 2012).
4. Group 4:
Group 4 focusses on the topic of
“Applications of Big Data” in various fields and areas such as mobile
computing, life science, instruments, genomics, healthcare, government, etc.
(Costa, 2012).
5. Group 5:
Group 5 covers the topic of
“Security and Ethics of Big Data” that
addresses code of conduct, data security, information ownership, human privacy,
human subject, and relates risks (CITI Program, 2015).
6. Group 6:
Group 6 constructs the topic of “Knowledge Management (KM) and its
Applications” that establishes framework, knowledge innovation, and KM
applications. Data is intellectual capital that is valuable enough to be
identified, managed, and protected, perhaps granting a
competitive
advantage in the marketplace (Tsai, 2013).
7. Group 7:
This group provides a guide for
novice researcher on research methodology (Ellis, & Levy, 2009).
In the Microsoft .pptx graphic
slides, a linear line is drawn from the origin at lower left corner up to the upper right corner. All thirty topic boxes are tagged along this straight line into seven groups of the topics and subtopics in
drafting the literature review. The straight line represents the linear
conceptual framework model that consists of seven topic groups. The graphic of
conceptual framework has four slides in sequential order:
a. The first
slide displays four topic groups 1, 2, 3,
and 7 on the linear line.
b. The second
slide displays three topic groups 4, 5,
and 6 on the extending line.
c. The third slide
shows the staircase figure that includes (1) General problem, (2) Specific
problem, (3) Purpose, and (4) Research question toward Qualitative methodology
(Ql) (Bryman, & Bell, 2011).
d. The fourth
slide shows the staircase figure similar to the third slide, but the research
question is toward Quantitative methodology (Qn) (Creswell, 2014).
Figure 2.1: A linear model is shown
in the graphic of the conceptual
framework of the literature review on Big Data Analytics research.
Slide 1: The first four topic groups 1, 2, 3, and 7 focus on Big
Data and its applications (Brown, Chui, & Manyika, 2011; Xian, &
Madhavan, 2014).
Slide 2: The last 3
topic groups 4, 5, and 6 address Big Data, security, ethics, knowledge, and
knowledge management (Alavi, & Leidner, 2001; Albescu, Pugna, & Paraschiv, 2009; Birasnav, Goel, & Rastogi,
2012).
Slide 3: A conceptual
framework for a Qualitative method that includes general problem, specific
problem, the purpose of the research study, and research question (Alasuutari,
2010).
Notice that for Ql design method, the Qualitative interview
strategies will use one of four distinct types of Ql interviews: Focus groups, Online Internet interviews, Casual conversations and in-passing
clarifications, and Semi-structured
and unstructured interviews (Rubin, & Rubin, 2011). Two additional
interview strategies proposed by other scholars are In-depth interviews and
Projective technique interviews (Hargiss, 2015). These Ql interview strategies
are different from each other based on the interviewer’s role, interviewees’
participation, and the relationship between an interviewer and the
interviewees. Semi-structured and unstructured interviews can be categorized as
in-depth interviews described above. Both semi-structured and unstructured
interviews are the extended conversations between an interviewer and
interviewee (Rubin & Rubin, 2011). In semi-structured interviews, a
researcher has a specific topic to learn in-depth and plans questions in
advance and asks follow-up questions in a narrow scope. In unstructured
interviews, a researcher has a general topic in mind, but specific questions
may be generated as the interview proceeds, in response to what the interviewee
says in generic scope. The purpose is similar to in-depth interviews. In the
in-depth interviews, a researcher as a well-trained interviewer will conduct a
face-to-face interview with a participant or an interviewee. A set of probing
questions will be provided to the interviewee. The interviewer encourages the
interviewee to express the point of view in the larger scope. The purpose is
collecting as much as memory, attitudinal and behavioral data from the
interviewee (DiCicco-Bloom & Crabtree, 2006).
The target population consists of two groups.
Group 1 includes numerous graduate and doctoral students who study on Big Data
and concentrate on data analytics at CTU or other universities. Estimate the size
of Group 1 is ten (10) students. Group 2 consists of professionals and analysts
who work and handle Big Data generated in a variety of the fields such as e-commerce
and market intelligence, e-government and politics, science and technology,
smart health and well-being, and security and public safety. Estimate the size
of Group 2 is another ten (10) participants. The total estimate of participants who can present a variety of
views, and who are willing to talk to the interviewer is twenty (20) professionals.
This student researcher selects
purposefully participants and sites for sources of data in qualitative
research. Qualitative data comes from many sources, for example, field notes,
existing documents, interviews, audio, and video tapes. The interview process
is one of three stages (i.e., interview process, observational process, and
artifact review) in the qualitative strategies. Qualitative data collection
strategies primarily include interview techniques and open-ended items on a
survey.
The purposive sampling is the selection of
graduate and doctoral students in academic community and professional analysts
from business and industry because they can purposefully inform an
understanding of the research problem and the central phenomenon. The purposive
sampling plan is a non-probability sampling strategy in which participants are
selected based on predetermined criteria such as their knowledge, understanding,
and experience on the Big Data topic. The purposive sampling also bases on
these students’ relevance to the research questions.
Slide 4: A conceptual
framework for a Quantitative method (DeVault, 2015).
In quantitative research, Qn data
collection strategy is some survey techniques. A survey is performed on a
population defined by the research objectives in a study. The population may be
tangible or abstract. Statistical inference is made on a population based on
data from a sample where the sample is a representative subset of the
population (Gall, Borg, & Gall, 2013). The sample consists of two types:
probability and non-probability. In probability, each member of the tangible
population must be known before sampling occurs. Each member must have an equal
chance of being the sample. In non-probability, this kind of sample can be used
either tangible or abstract population.
Data play a vital role in
descriptive statistics. The form of data is the numbers in data collection,
particularly Qn statistical analysis. According to Field (2015), the secret of
life on people is hidden in numbers. To discover or reveal the secret of life,
it requires a large-scale analysis. Researchers use data in the form of numbers
or values to represent people who are participants, organizations or subjects
under a research study. Researchers may get lost in numbers. They may ignore
the research study’s objectives, goals or purpose because they constantly deal
with data, information in the form of numbers or values every day. They forget
the research objective is to contribute to the knowledge pool, to improve human
life or to provide more benefit to people, organizations, and environment
(Huck, 2015). For example, people, who watch the weather forecast, football
games, stocks market, etc., all see the numbers such as ambient temperatures,
game scores, stock values.
Notice that the conceptual
framework depicts both Ql and Qn methods, but the proposed research
dissertation will perform either one of them only due to the heavy workload and
time constraint in the third academic year at CTU.
In summary, the proposed research topic is “A research study
of extracting and transforming Big Data, particularly huge healthcare data sets, as values without context into meaningful information such as hidden
common pattern correlation, the frequent predictive occurrence of events that
benefits humans in the evolution of DIKW (Data, Information, Knowledge and
Wisdom) (Ahlemeyer-Stubbe,
& Coleman, 2014).” This review of the literature
provided a review of the literature
that includes an introduction on big data retrieval for insightful information
to assist decision-making in business. It addressed the reviews theoretical
literature that consists of recent and seminal literature. The literature
review consisted of contextual literature that comprises most recent journals,
credible articles, and scholarly periodicals in three sections focusing on (1)
what big data is, (2) a discipline of big data analytics, and (3) meaningful
insights. It mined deeper in both theoretical and contextual literature to find
several gaps in a science of big data, DIKW evolution, and universal standardized
big data analytics software tool. The gap of knowledge in big data and data
analytics includes (1) a new discipline in big data science, (2) DIKW evolution
toward knowledge management and content management, (3) enhancement of disease
treatment and improving health care services, and (4) universal analytics tools
for processing data effectively such as capturing, storing, analyzing various
large unstructured and semi-unstructured data sets and other issues like data
sharing, transfer, analysis, curation, and result presentation. It also presents the conceptual framework based
on the topic in a linear model that presents a straight line with seven topic
groups from one hundred twenty-two (122)
references in a descending staircase of the general research problem, specific
problem, research purpose, and especially research questions that can lead to
select either a qualitative methodology or quantitative methodology. Notice
that since the research study is very time consumed with a heavy workload,
either methodology will be selected for conducting the research study at CTU
(Colorado Technical University) only. The next Chapter Three will discuss the
method design in the research study of Big Data Analytics’ impact on Big
Data.