TSL Blog: Transforming Big Data into Quantitative Insights

Literature Review: Transforming Big Data into Quantitative Insights

Chapter One provides an overall overview of the research study that contains the background information and framework. It finds that Big Data with 5 V’s is an organizational asset may contain common hidden patterns, variables correlation, and frequently predictive occurrence of events in healthcare. Mining Big Data is the tremendous challenge to many organizations. The impact of big data analytics on Big Data is little known in academic community and industries.

In Chapter Two, the paper will explore Big Data, particularly five dimensions (volume, variety, velocity, veracity and value) as a new frontier in business. Big Data becomes a socio-technical phenomenon. The complexity of the huge data sets in healthcare will be addressed. The paper will examine the big data analytics with the up-to-date advanced analytical tools on the Big Data’s challenges and data access issues in both public and private sectors. The outcomes of data analysis are the insights that may drive a new science in big data, the evolution of wisdom.

The standardized big data analytics software tool will be suggested for analyzing Big Data.

In a review of the literature in-depth, many documents such as academic, and practitioner articles or journals from various accessible sources are collected and examined. Other documents like agendas; attendance registers, minutes of meetings; manuals; background paper; books and brochures; diaries and journal; events programs; letters and memorandum; maps and charts; newspapers; press releases; survey data; and various public records are also verified (Bowen, 2008). Some advanced analytical and statistical tools in Big Data Analytics used to collect, analyze, and extract Big Data for meaningful information, holistic knowledge and professional wisdom are also evaluated for making a practical and strategic decision or gaining the competitive edge in the dynamic market locally and globally (Imanual, 2015; Lurie, 2014). This literature review will focus on three primary components: Big Data, Big Data Analytics, and Insights. These components will be discussed in-depth based upon the research topic of retrieving insightful information from big data, particularly data in healthcare, for pattern correlation or the frequent predictive occurrence of events in the evolution of DIKW (Data, Information, Knowledge, and Wisdom (Ahlemeyer-Stubbe, & Coleman, 2014) in Chapter Two of the research dissertation. It provides a critical review of the theoretical and contextual literature, and graphic of the conceptual framework of the proposed dissertation.

Big Data

With new data computing, automation, Web technologies in the competitive data-driven market and Internet-based economy, data at low storage cost and fast processing have exploded and become ubiquitous and ample in both public and private sectors (Chen, Chiang, & Storey, 2012; Richards, & King, 2014). Big Data, a generic term for data pose major challenges of extracting or transforming complex data into insightful information for many organizations (Gartner, 2013). Big Data is a new paradigm that combines five characteristic dimensions: volume, velocity, variety, veracity, and value (Goes, 2014; Jacobs, 2009):

- Volume:

One of the primary characteristics of Big Data is its massive volume. The size of a data set can be from terabytes to petabytes or zettabytes. Storing the enormous amounts of data becomes a real problem for many organizations, particularly mid-size or small-size companies. Exploring and understanding the big data is a technical issue for a lot of users (Economist, 2010). For example, Walker (2015), a marketing executive at Vouchercloud, estimated 100 GigaBytes (GB) of data generated per second in 2002; 28,875 GB of data per second in 2013, and 2.5 quintillions (or 2,500,000,000,000,000,000) bytes of data generated every day in 2015.

- Variety:

Big data is usually in various forms in multiple formats. It can be categorized into structured, semi-structured, unstructured, and metadata types for control and processing objective. Big Data are generated by humans, different devices, and applications (Dell Midmarket Research, 2013). For instance, the data is generated by human operations from the black box, social media, stock exchange, power grid, transport data, search engine. Data that contain voices of the flight crew, microphones, earphones, etc. contain voices of the flight crew members, earphones, and microphones are kept safely in the black boxes of the airplanes, helicopters, jets. Data also include streaming data, online conversations, users’ posts from millions of people.in social media networks like Facebook, Twitter, and Linkedin. Data may have information on share trade in the stock exchange in the market such as Dow Jones, Nasdaq, or S&P 500. The power grid data hold information from consumers and power stations. The transport data are information of a vehicle such as model, capacity, weight, specifications, etc. A lot of data and information from various databases are generated by search engines like Google, Yahoo, IEEE, and ACM.

- Velocity:

Recently, many analytical tools are available for retrieving meaningful information from big data. For instance, Jones (2014) listed the top 10 data analysis tools for business that include Tableau Public, OpenRefine, RapidMiner, KNIME, etc. Machlis (2011) discussed 22 free analytics tools including data visualization for analyses and presentations. Lurie (2014) summarized 39 analytics tools for big data visualization in cloud computing. These advanced tools can usually execute large data sets and support cyclic data flow in in-memory computation at dynamic ultra-fast speed. Apache Spark’s execution engine can run programs at 100 times faster Hadoop MapReduce or 10 times faster on the disk. In the conference (PAPIs.io, 2016) on machine learning, Steenbergen (2016) presented the possibilities of distributed deep learning which can be image analysis, image generation, most famous, learning and playing Go by distributing training, computation over a Spark cluster. Spark allows users to next level of big data processing, innovating data science. Amazon Web Services’ EC2 has GPU instances that users can span on-demand for a dollar per spot instance about two to three times cheaper than other instances. A deep learning framework is Berkeley’s Caffe or Flickr. Users can run an existing cluster alongside other Spark jobs. Spark allows users to train multiple models at once and even leverage existing models in deep learning with Spark.

- Veracity:

Veracity is a truthfulness of the information extracted from big data. The real meaning of information is important for managers and business strategists who have the responsibility to make a concrete decision in business that could lead the companies to success or failure. Their decision and vision in business are extremely important for the companies (IBM, 2011). A case study at USC (University of Southern California) Annenberg Innovation Lab from IBM helps it to find the veracity of Big Data in Analytics. The USC Annenberg Innovation Lab wants to uncover insights buried in the millions of daily online conversations or streaming data. It uses IBM Analytics tools to capture, collect and analyze these massive data in various forms in tweets, Facebook posts in different fields for the trends in almost-real time. The Lab applies sentiment analytics, social media analytics, predictive analytics to demonstrate the impact of TV ad within a day of airing, help show sentiment of debate viewers in real time, and expect to enable countries to predict early notice of potential health crises or civil unrest (Smith, 2013).

- Value:

The value of Big Data is a real-time usefulness of the data under use. Its usefulness is similar to veracity in making a decision (Wright, 2014). Managers evaluate the value of data for its worthiness or importance in considering the responsive decision (Snider, 2012). Data in motion are the spontaneous values at the moment that are created on the fly. Data in motion turn an event to insight at the moment. Their most value is the moment of truth created at that point in time. The USC scholars used in a sentiment analytics project with IBM analytics tools to collect, capture, and analyze massive data in various forms for insights. The solution at USC Annenberg Innovation Lab is a success on the application of big data analytics in near-real time by analyzing millions of social media conversations (Smith, 2013). The Illinois healthcare agency creates a comprehensive enterprise data warehouse, EMPI (enterprise master patient index) to display the patient’s record view in full review capabilities. EMPI provides each patient’s insightful information from the medical records collected from multi-sources across multiple agencies such as a hospital, health insurance, pharmacy, lab, drugstore, etc. A new analytics platform and WebFOCUS business intelligence (BI) created by Information Builders provide analytical queries and analysis reports to allow users such as clinicians, administrators, etc. to monitor critical metrics for performance management purposes. EMC® Greenplum® Data Computing Appliance (DCA) and Greenplum™ Unified Analytics Platform (UAP) aid an enterprise-wide view of its patient data. By implementing EMPI and DCA, the Illinois healthcare provider can increase the healthcare quality and reduce the costs by tackling the separate nature of data among clinical departments, hospital systems, labs, and clinical applications.

Data can be categorized into quantitative and qualitative data that can store some valuable information content. Quantitative data can have different forms such as nominal data, ordinal data, binary, scale, and metric. Qualitative data comes from surveys, interviews, online questionnaires, etc. (Brown, Chui, & Manyika, 2011). In real world perspectives, data is a fundamental representation of facts without context by human observation. Ahlemeyer-Stubbe et al., (2014) defined data as facts, figures pertinent to the customer, consumer behavior, marketing, and sale activities. Data becomes an essential element for the products that store information about the relationships among systems, sources, etc. and it is managed in a centralized environment. Data that has been collected and stored in firms’ IT database management systems includes two primary types: (1) internal and (2) external data. Internal data of the organization is generated from different processes to handle the daily business. Its quality and reliability are in control by the organization (Mayer-Schönberger, & Cukier, 2013). For example, data on specifications or invoice of products is internal data, On the other hand, external data that is generated outside the organization’s own processes often contains discrepancies and is used as additional data or as reference values such as credit rating (Huck, 2015). Some huge data sets are noisy. Big noisy data is defined as big data with corrupted electronic signals, erroneous in some processing steps, or unstructured data that cannot be interpreted by machines. Noisy data is meaningless data. Hardware failures, programming errors or gibberish input from speech or optical character recognition (OCR) programs can generate noisy data. Noisy data increases storage space and affects the results from data mining analysis. Data analysts or data scientists can use statistical analysis to filter out noisy data.

AllSight (2016) characterizes Big Data into three typical types: (1) structured data, (2) unstructured data, and (3) semi-structured data. Structured data is relational data can be stored and analyzed in RDBMS. They include POS data email, CRM data, financial data, loyalty card data, help desk tickets. Unstructured data are the most difficult to deal with. Unstructured data are generated from GPS, blogs, mobile data, PDF files, web log data, forums, website content, spreadsheets, photos, clickstream data, RSS feeds, word processing docs, satellite images, videos, audio files, RFID tags, social media data, XML data, call center transcripts. And semi-structured data are data formatted in the form of XML or JSON form.

In the last five years, Big Data has been emerging as a contemporary boundary for innovations of the information technology (Gartner Group, 2013). It offers new opportunities in the evolution of DIKW, particularly revolutionary information for both public and private sectors. In many organizations, data and information may be used interchangeably with a vague distinction, particularly in computer science (CS) and economics. CS scientists view information as coded data while economists consider information as additional knowledge not stored in the data system. Data is equated with information it presents (McNurlin, Ralph, Sprague, & Bui, 2009). Recently, scholars categorize Big Data into “data at rest” and “data in motion” (Ebberg, 2013). Data at rest or traditional data is static or inactive data that contain values collected and stored in servers, computers, or databases to be analyzed later for decision making. It includes files, backup tapes, tables, patient records, etc. On the other hand, data in motion, data in transit, or data in use is dynamic data processed by analyzing it on the fly at the real time in the network or in the cloud servers without storing it in the hosts. Data in motion may flow over the public or untrusted network, e.g., the Internet, in the confined private network, for example, Intranet, corporate LAN (Local Area Network), WAN (Wide Area Network) (Moore, 2014). Data in motion is the spontaneous values at the moment that are created on the fly. Data in motion can turn an event into insight at the moment the event occurs. Their most value of the data is the moment of truth created in time at that point of the event (Nixon, 2013). For example, the Annenberg Innovation Lab at the University of Southern California used IBM analytics, e.g., IBMInfoSphere Streams and IBM BigSheets to uncover insightful feelings (Like or Unlike; agree or disagree) of the target audience at the real time in the presidential debate in 2008. The data comprise emails, the web, Internet protocols. Processing data in motion instantaneously requires the advanced analytical tools such as IBM InfoStreams, Tableau, Hadoop, or Apache Spark (Wayteck, 2011).

The enormous mountain of data with hidden treasures generated by different devices, sensors, and applications forces many organizations to focus how to control, extract, transfer, and load Big Data (Carter, 2014). Big Data carry some typical benefits (AllSight, 2016). For example, marketing agencies, banking systems monitor data and information in the social networks, e.g., Facebook, Twitter to learn about the response to their promotions, campaigns (Capgemini, 2013). Product companies and retail organizations can plan their production by using information such as preferences and product perception of customers (IBM Analytics, 2015). Hospitals, insurance agencies can provide better and high-quality services based on data from patients’ previous medical records (Nolan, 2015). Big Data technologies provide more accurate analyses that may lead to more concrete decision-making for cost reductions, operational efficiencies, and reducing risks. Most companies search for an alternative way to concrete decision making for successful data-driven strategy in the competitive market while academic and scientific communities seek to understand the business economy in-depth. Industrious leaders, academic scholars, and data scientists all hold high expectation in Big Data to leap the society with more wealth, prosperity into a new frontier of innovations (Abhishek, & Arvind, 2007).

Boyd and Crawford (2012) projected Big Data on the rise as an organizational development that employs the interactions between humans and technology. Big Data becomes a social-technical phenomenon because a scheme of arrangement and process of complex work design that employs the interaction between humans and technology in the workplaces (Long, 2013). The social-technical system refers to the interaction between complex infrastructures and human behaviors. It is about joint optimization such as interrelatedness of social and technical aspects of an organization or the society as a whole (Trist, & Bamforth, 1951).

Jacobs (2009) did a study of the pathologies of Big Data and found that it was difficult for humans to analyze the stored unstructured data with the large size spreadsheets and to extract data out from traditional database management. Using the pathology approach in Big Data by examining a sample of significant data did not work. Jacobs’ study represented how difficult it is to handle Big Data due to the limitation among rows by columns in the existing spreadsheet. Some differences between conventional data and Big Data are summarized in the table 2.1 as shown below:

Table 2.1 The differences between Conventional Data and Big Data.

Source: The table was synthesized and built for this paper by this student (2016)

Scholars raise the debatable questions on critical interrogating assumptions and biases on a social-technical phenomenon of the Big Data (Mayer-Schönberger, & Cukier, 2013). Six provocations to ignite conversations on the Big Data include phenomenon, culture, scholarship, technology, analysis, and mythology. Many scholars and leading scientists provoke for an international conference to discuss and learn about Big Data in extensive utopian and dystopian bombast (Thomson, 2010). Since Big Data is a socio-technical phenomenon, it is worthy of a robust research study.

To narrow down the huge Big Data in the proposed research dissertation, data in healthcare was studied in three Internet-based, participatory, cloud, and mobile domains: (1) Personal health information (PeHI), (2) Clinical health information (CHI), and (3) Public health information (PuHI)(Schneiderman, Plaisant, & Hesse, 2013). Statistics showed that unstructured data occupied about 70% of an organization’s data asset (AllSight, 2016).

Personal health information is the records that healthcare providers and patients collect for information about their own health habits and practices. Monitoring human body with sophisticated sensors enables physicians and nurses to understand pro and con in treatments (HIPPA Act, 1996). Based on personal health information such as patient medical activities, clinical health information like electronic health records systems, and public health information like public health data, data researchers focus on healthcare data and data analytics that holds the promise to improve the quality of healthcare delivery, and contains the potential to enhance patient care, save lives, and lower the treatment cost. There are many advantages of using big data to healthcare in clinical operations, research and development, public health, evidence-based medicine, genomic analytics, pre-adjudication fraud analysis, patient profile analytics, etc. (Raghupathi, & Raghupathi, 2014).

Clinical health information is electronic health records systems that improve patient care and valuable insights into treatment patterns. With outcomes from data visualization, hospitals and universities continue to improve nursing and physician training programs. However, training physicians for what they should know is increasingly difficult because the large scope of knowledge on specialized cases, various medications, and professional guideline are rapidly changed from results (Quora, 2014).

Public health information is a large amount of collected public health data that assists policy makers on more reliable decisions from the US National Center for Health Statistics, Centers for Disease Control, Census, World Health Organization, etc. However, using the health information to derive insights remains a challenge (Agadish et al., 2014).

Digging into healthcare data, Raghupathi and Raghupathi (2014) discovered that BDA on Big Data in healthcare could make a significant impact upon various fields in health care. The positive outcomes could include detecting the diseases at earlier stages; managing individual and population health efficiently; detecting healthcare fraud more quickly; estimating a large amount of historical data such as length of stay, choosing elective surgery, and no benefit from surgery; patients at risk for medical complication; patients at risk for advancement in disease stages; pinpointing patients who are the greatest consumers of health resources. The promising results comprised causal factors of illness progression; providing patients with the information for making informed decisions; managing patients’ own health; tracking healthier behaviors; identifying treatment, reducing re-admissions by lifestyle factors that increase a risk of the adverse event; improving outcomes by examining vitals from at-home health monitors; and managing population health by detecting vulnerabilities within the patient population during disease outbreaks.

For Big Data in healthcare, disease treatment and healthcare services are in progress, but they are lagging in a slow pace. They do not keep up with the exponential spread of diseases and illness, especially on the elderly in the society. A lot of diseases have no cure such as AIDS, Alzheimer’s disease, and various cancers. The gap in modern treatment, efficient cure, and effective prevention still exists in healthcare services and health institutions.

Big Data Analytics

Today, society constantly continues changing, especially in technology such as Big Data Analytics (BDA). Big Data Analytics (BDA) is a systematic process of evaluating large amounts of data in varying types for the purpose of identifying hidden patterns, variables relationships, unknown correlations, market trends, customer preferences, and other useful information such as diagnosis of illness or detection of fraud (Taylor, 2015). Data analytics is an extent of transforming data into insightful information. Many advanced approaches, vigorous techniques, great models, and infrastructures are employed to retrieve desired information. Recently, an emergent trend of BDA becomes a popular demand in many fields: education, manufacturing, marketing, politics, healthcare, security, defense, and insurance. Demand in BDA provides plentiful opportunities for employment for big data talents who possess highly analytical skills in many organizations (Sondergaard, 2015). However, the abilities in extracting information still encounter the limitation in organizations (Snijders, Matzat, & Reips, 2012).

With benefits of using Big Data Analytics to mine Big Data for insights, scholars and data scientists from Tutorials Point (2016) distinguished Big Data Analytics technologies into two categories: (1) Operational BDA, and (2) Analytical BDA. Operational BDA technology includes NoSQL systems like Amazo DynamoDB, Cassandra, Infinite Graph, MongoDB that provide operational capabilities for real-time interactive workloads. Data are mostly captured and stored for cloud computing similar to RDBMS (Relational database management system). On the other hand, Analytical BDA technology comprises systems such as Massively Parallel Processing (MPP) database systems, MapReduce system, Hadoop ecosystem, or Apache Spark that have analytical and statistical capabilities for extremely complex analyses. These two classes of BDA technology perform complimentarily and frequently. Both can be deployed together, and enhance each other. The table 2.1 below shows a comparison between two classes of technology in Big Data.

Table 2.2 shows a characteristic comparison between Operational and Analytical Systems.

Source: Adapted from Tutorials Point, 2016

Handling Big Data in a traditional approach such as centralized database server systems, RDBMS (Relational Database Management) with Microsoft Excel, Access no longer works (Connolly, & Begg, 2014). To analyze Big Data in colossal volume and various forms, many organizations such as Amazon, Apple, Google, IBM, Intel, Microsoft, Intel, etc. have developed their own high-tech statistical tools or used the advanced analytical tools (Brandon, 2015). Some typical analytical tools are AWS (Amazon Web Services), Tableau, R, Apache Spark (Machlis, 2011). In business applications, the range and strategic impacts of BDA on Big Data are vast. Applications of BDA are used in many fields: healthcare, automotive, presidential campaigns, highway traffic, insurance, banking, social networking, law enforcement (Natarajan, 2012).

BDA can be used in Internet search engines, e.g., Google or Yahoo, and social media networks, e.g., Linkedin, Facebook, Twitter, etc. to collect, capture and analyze online conversations, streaming data, users’ posts for learning human behavior or actions on their homepages. BDA can detect fraud and spam, ameliorate website design, and explore advertisement opportunities (Clark, 2015). For example, modern astronomical telescopes, genome sequencers, physics particles accelerators generate a vast amount of data for BDA.

One example of BDA is using Google Fusion Tables. Google Fusion Tables is a Web-based service for data management used to gather, visualize and share data tables. Data are captured and stored in multiple tables for viewing or download. It provides dataset visualization and mapping. Its platform is a browser such as Chrome, Netscape, etc. (Halevy, & Shapley, 2009). Data can be displayed visually in different forms such as bar charts, pie diagrams, line plots, timelines, scatter plots or geographical maps. The data can be exported in a comma-separated values file format. It has a skill level 1 for users who have some basic spreadsheet knowledge. Google Fusion Tables is free and easy to use. Another example of BDA applications is QlikView. QlikView has the ability of simple drag and drop techniques in self-service in the creation of data visualization without writing many SQL query commands. Qlikview can connect various databases from different vendors into QlikView's centralized repository. It has intelligent indexing method to discover new data for patterns and trends in different data types. QlikView provides dashboards to aid decision support systems. Its platform uses 64-bit Windows with a skill level of 2 (Qlik, 2015). QlikView accepts dynamic data type formats from any source to its in-memory analytics platform. It has many channels of documentation for building big data quickly without disruption without downtime. Also, IBM Watson is a question answering computing for machine learning, retrieving information, presenting knowledge, and automatically reasoning. It has the capability to find the correct answer after running a hundred algorithms of proven language analysis. IBM Watson’s applications are often used in financial services, telecommunication, healthcare, and government, and game contests such as Jeopardy (Thomson, 2010). Users are not required to know statistics because IBM Watson computes all in the background. IBM Watson also provides visualization and analysis applications based on the browser with the level skill of 1. IBM Watson is an analytics tool that has an ability to retrieve major information from all documents, provide hidden patterns, insights, and correlations across vast data sets. 80% of data are unstructured in various forms such as new articles, online posts, research papers, or organizational system data (Thomson, 2010). And it is a free tool.

According to Herodotou, Lim, Luo, Borisov, Dong, Cetin and Babu (2011), data scientists and leading scholars expect a primary breakthrough from distributed and grid computing in such data. As a result, many disciplines, e.g., engineering or applied science, have sub-branches in computing of biology, economics, or even journalism. Data Analytics such as transforming data into insights has gained more popular demand as a new trend in corporate strategy in many organizations. Today, Big Data Analytics is a new corporate trend and the key success in business (Herher, 2014).

Big Data poses enormous challenges. Many organizations, particularly academic scholars, and data scientists encounter a great deal of barriers, difficulties in retrieving. Big Data for insights are discussed and explored by many academic scholars, and data scientists. The massive volume of Big Data cannot be stored properly in traditional database systems such as RDBMS (Relational Database Management Systems) (Sadalage, & Fowler, 2012). Unstructured data are generated from GPS, blogs, mobile data, PDF files, web log data, forums, website content, spreadsheets, photos, clickstream data, RSS feeds, word processing docs, satellite images, videos, and audio files (AllSight, 2016). Big Data under analytics performed at the lower cost and right time becomes a major factor of success in the industry. The hardcore science disciplines have worked on volume, and perhaps velocity (Brown et al., 2011). However, a study of 5 V’s altogether is an exciting and sophisticated challenge. Based on data, meaningful information, logic knowledge, and wisdom extracted from big data, researchers can build a theoretical framework of wisdom (Minelli, Chambers, & Dhiraj, 2013). Organizing and converting unstructured data into categories cause an enigma and a headache issue for data scientists. Some typical challenges in Big Data retrieval are (1) capturing data is difficult because of its massive size, (2) curation is not easy, (3) storage requires huge memory and disks, (4) sharing data is complicated because it is in various forms, (5) transfer data is time consumption because of huge volume, (6) analysis of data requires advanced analytical tools and (7) the presentation of data results is sophisticated and requires data visualization tools (Microsoft Power BI, 2016). With data explosion at 44 zettabytes by 2020 (Vizard, 2014), organizations have no choice. They have to use BDA to maximize computing power and accurate algorithms on the prevalent belief that big data offer a profound form of intelligence and knowledge that can produce insights for a competitive edge.

Hadoop MapReduce and Apache Spark are two popular analytics software tools used by many companies. Both tools can complement each other or work together. For example, Spark can work on Hadoop Distributed File System (HDFS). Spark’s applications can run much 100 times faster than these run on Hadoop MapReduce because Spark uses RAM (in-memory) while Hadoop runs on the hard disk. Hadoop MapReduce has a more flexible, wide-ranged options but Spark can convert a big chunk of data into actionable information faster. Trovit is a classified ads engine uses HDFS by using many smaller servers to solve the storage problem with the huge amounts of data. However, when using Hadoop MapReduce on HDFS, developers and users experience the inflexible application programming interface or API and strictly on disk activities. Apache Spark offers a flexible and fast distributed processing framework. Developers can run MapReduce code in production on the Spark platform at high speed and ease of use. For instance, Trovit team and Spark innovates a set of libraries on top of the framework for rapid processing that is suitable on their resources. Today, Trovit uses both Hadoop and Spark combo in renewed flexibility in the language of the data, and the ability of parallel processing for effectivity and efficiency (Riggins, 2016). Microsoft acquired Revolution Analytics; IBM bought SPSS; North Carolina State University founded SAS, and Bell Labs founded S language in the rise of big data analytics. R language is the next-generation innovation of S language. R language has a set of generic libraries for various applications such as econometrics, processing, natural language, etc. Similarly to XML, some R libraries target industry issues and problems in clinical trials, genomics, genetics, insurance, education, finance, manufacturing, and healthcare. R supports desktop and server-based processing. It also performs parallelized processing within Hadoop clusters like Apache Spark in data mining and data warehouses. R expands in other complex fields in biology, genomics in statistics (Schmidt, 2014). Many organizations such as Google, Youtube, FDA, The New York Times, Facebook, etc. have used R for graphical and statistical computing on big data sets. Contributors in a fast growth community promote R applications in Big Data processing and development in many organizations. R in rapid expansion appears to take over SAS and SPSS-controlled market recently (Datacamp, 2014).

Many advanced approaches, vigorous techniques, great models, and infrastructures are employed to retrieve desired information. However, the abilities in extracting information still encounter limitation (Snijders, Matzat, & Reips, 2012). Notice that to tackle these challenges; major organizations often use enterprise servers in a large scale configuration at a high cost.

For the rise of the cloud and distributed or grid computing techniques, data scientists and information professionals play key roles in the assessment of Big Data. They all know that Big Data contains enormous algorithmic information and the devil in it (Herher, 2014). From medicine, security, education, to politics, organizations that have the capability to use Big Data for theories, scenarios, assessment on the past assumptions will gain a competitive edge. Big Data can extend goals of life, liberty, and happiness (Lazari, 2016). However, the keys to open the chest of data are in the hands of private companies such as NSA’s secret stores and online that always protect their asset property. Users do not have the authorization to access the data repositories, and they cannot claim ownership of their own data (Resnik, 2011). People have no idea of the depth of data kept secretly by the government (Voosen, 2015). Accessing Big Data for statistical analysis or analytics in the open environment is still a challenge for information professionals (Laurila, Gatica-Perez, Aad, Bornet, Do, Dousse, & Miettinen, 2012).

Insights

Insights are results or outcomes from data analytics work on big data. They are useful information that is meaningful and valuable to organizations for many business purposes such as assisting managers (1) to make sound and precise decisions in business, (2) to improve business performance, (3) to increase organizational productivity, and (4) to gain and sustain the competitive edge in the dynamic market locally and globally.

In general, decision making traditionally is a participatory process for several participants (about 5 to 10 persons) who collect information, analyze problems or situations, weigh courses of actions, and select the best solution for a problem in a wide range of business. The process used to arrive at decisions can be structured or unstructured. Time pressure or conflicting goals that are often external contingencies impact the development and effectiveness of decision-making groups. Group decision methods include (a) Delphi technique built by RAND Corporation about six decades ago (RAND Corporation, 1950), (b) dialectical inquiry, (c) brainstorm method, (d) nominal group technique. In the 1970s, Sprague and Carson (1982) developed Decision Support Systems (DSSs) to aid decision maker to confront ill-structured problems. Other information-centric decision-making systems are Executive Information Systems (EISs), Expert Systems (ESs), agent-based modeling and real-time CRM Customer Relationship Management) (McNurlin, Sprague, & Bui, 2009). Business Intelligence (BI) that also facilitates corporate decision-making consists of data mining, data warehousing, and OLAP (online analytical processing) (Connolly, & Begg, 2014). Other approaches in seeking decision making indirectly are think tank or reflection pool such as Brookings Institution or Heritage Foundation; traditional forecasting and contemporary scenario planning (Daniel Research Group, 2011; Seemann, 2002; Wade, 2014). Up to date, the decision-making technologies continue evolving rapidly from Big Data, Big Data Analytics to Artificial Intelligence and IoT (Internet of Things). To gain a competitive edge in a dynamic data-driven and Web-centric market, organizations use BDA to mine insights within Big Data for better decision making in a wide range of applications affecting from technology, economics to locality and globalization (Paulding, 2016).

From the intensive review of the literature, it seems that there are many scholarly authors, professional practitioners who describe, identify and discuss Big Data, a hot emerging trend in the data-centric world. However, apparently, there is no one who mentions a study of Science of Big Data, called Big Data Science. Big Data Science can be a new educational branch such as Computer Science, IS (Information Systems) or IT (Information Technology). Big Data Science is a discipline that seeks to build a scientific foundation for such topics as Big Data Analytics, Big Data Software, data/information/knowledge/wisdom processing, algorithmic solutions of data-related problems, and the algorithmic process itself. The gap is there is no available framework or foundation for Big Data Science.

The topic of research is the evolution of wisdom that extracts big data (D) into information (I), transforms into knowledge (K) and then constructs wisdom (W). The DIKW evolution will establish DIKW model whose foundation is for Data Management (DM), Information Management (IM), Knowledge Management (KM), and Wisdom Management (WM) (Ahlemeyer-Stubbe, & Coleman, 2014). DM and IM have already been developed. However, KM and WM are relatively new and are not addressed completely. These topics deal with knowledge and wisdom. Knowledge is the processed information among individuals, individuals to groups, or across groups while Knowledge Management is a process of coordinating of knowledge. Differently from information, knowledge is an understanding of customers or relationship with a notion of the idea that is acquired by study, investigation, observation or experience not based on assumptions or opinions (Ahlemeyer-Stubbe et al., 2014). If facts are about data and reporting is about information, and then analytics is about knowledge. Knowledge is considered as intellectual capital (IC) or, at least, part of IC, a valuable organizational asset that requires identifying, managing, sharing and protecting for competitive advantage in the marketplace. The wisdom that is constructed from knowledge may be one of the most difficult subjects to deal with because both knowledge and wisdom are in association with humans – the homo-sapiens. Knowledge, wisdom extracted from Big Data that are not thorough, unclear are not characterized. It is not clear how to manage these intellectual assets effectively and efficiently in the organizations. The gap is a need to determine and manage knowledge and wisdom effectively.

In the past five years, many Big Data Analytics software tools available with data explosion. Selecting a right BDA platform and appropriate software tool(s) becomes critical to any organization regarding advanced technology, implementation, deployment, friendly use, maintenance, training, customer support, and cost (Cohen, Dolan, Dunlap, Hellerstein, & Welton, 2009). According to Minelli, Chambers, and Dhiraj (2013), big data analytics (BDA) is a scientific and systematic process to evaluate the massive volume of data in various forms at high speed. BDA’s objective is to identify specific patterns, insightful relationships, unknown correlations, and other meaningful information. There are different BDA software tools available in the industry. At least, twenty-two tools for data visualization and analysis such as Tableau, R Project, QlikView, Hadoop MapReduce, Apache Spark, etc. are available and free of charge for users (Machlis, 2011). The gap here is no universal standardized data analytical software tool available on Big Data for professional users (Patrizio, 2014).

Graphical conceptual framework of Big Data retrieval

Today, the high-tech society is constantly changing exponentially, especially in technology (Grant, 2016). Data Analytics is a data process to retrieve insightful information from Big Data. Many advanced approaches, vigorous techniques, great models, and infrastructures are employed to retrieve desired information (Jones, 2014). However, the abilities in extracting information still encounter biases and limitations (Snijders, Matzat, & Reips, 2012). Scholars, data scientists, and data analysts in many fields have identified more gaps of knowledge in using Big Data for insights, particularly in healthcare (Tutorials Point, 2016). For example, capturing, storing and analyzing Big Data are more complex and limited due to its colossal volume size and various varieties of forms. Big Data has different structured, unstructured, semi-structured types in many different forms such as texts, blogs, mobile data, Web log data, forums, audio and video files, images, etc. Control and maintenance of these Big Data forms are often out of traditional methods (Jagadish, Gehrke, Labrinidis, Papakonstantinou, Patel, Ramakrishnan, & Shahabi, 2014). Digital curation such as preservation, collection, and maintenance of numerous forms of Big Data usually encounters the bottleneck problems or freezing systems issues. Sharing Big Data among organizations in the interconnected networks is complicated due to different formats and platforms (Lohr, 2012). Transferring data among computer hosts, and servers are very time-consumed, freezing the system, and often interrupted. Performing analysis on Big Data requires advanced analytics tools with highly skillful professionals (Baroni, & Eskandarian, 2014). Also, presentation of analytical results or outcomes from the analyses is sophisticated to the audience who do not understand due to lacking of the technical background (Few, 2016). Results and outcome of the analysis usually require some data visualization tools such as Tableau or QlikView for display to the audience (Pandre, 2016).

In the advent of the fourth-generation languages and personal computers in the 1990s and new standard computing, automation and Web technologies recently, data at the low cost of storage and processing becomes ample and ubiquitous (Cukier, 2015). With data explosion and big data technologies in e-commerce, finance, insurance, healthcare, etc., data ubiquity drives the evolution of data, information into particularly knowledge and wisdom at ultra-fast speed (Erickson & Rothberg, 2014). Managing information in business leads to a modern field Business Intelligence (Anonymous, n.d.). Sharing knowledge among individuals and across groups leads to new disciplines such as data science, knowledge management, or content management (Birasnav, Goel & Rastogi, 2012; Nonaka, & Takeuchi, 1995). Therefore, a study of extracting and transforming Big Data like values or numbers without context into meaningful and statistical information for prediction of the events’ occurrences (e.g. earthquake, disasters, healthcare DNA decoding, flu crises, etc.) is well worthy of a research because it will immensely achieve four objectives: (1) making a practical and strategic decision, (2) improving business performance, (3) increasing organizational productivity, and (4) gaining and sustaining (taking advantage of) the competitive edge in the dynamic market locally and globally (Davis, 2016; Gartner, 2016). Particularly in healthcare, the study about retrieval of insights on personal, clinical, and public health information will drive advancing behavior of genes, drugs, and proteins then used to design new medicines that benefit humans and animals (Goodfellow, Bengio, & Courville, 2016). Those companies use Big Data Analytics to gain competitive advantages and their own survival. Some typical advantages from Big Data analyses in the evolution of DIKW are increasing business, improving operational efficiency, driving revenues, acquiring new customers, and winning more market shares (Podesta, Pritzker, Montz, Holdren, & Zients, 2014).

Based on the research topic, in-depth literature review, and unknown gaps of knowledge, the conceptual framework for a proposed research dissertation is developed and described as follows:

The graphic of the conceptual framework is created as a linear model in Microsoft PowerPoint (MS pptx) document. It is based on the literature review document of more than ninety (92) credible articles and many papers on the topics and subtopics. Each topic or subtopic has entered the box. The topics or subtopics are displayed in thirty (30) topic boxes. These topic boxes are arranged into seven logically ordered groups:

1. Group 1:

This group explains the topic of “What is the Big Data?” that describe, discuss, and explore Big Data, its challenges, difficulties, obstacles in collecting, storing, analyzing, processing, etc. (George, Haas, & Pentland, 2014).

2. Group 2:

This group discusses the topic of “Big Data and Research” that discuss the knowledge gap, trends, and research on Big Data. It highlights the capabilities of the Big Data on gaining knowledge in the market, customers and demands (Bughin, Chui, & Manyika, 2010).

3. Group 3:

Group 3 addresses the topic of “Benchmark on Big Data” that establishes a standardization of Big Data such as the underlying business benchmark, data model and synthetic data generator, etc. that focus on the variety, velocity and volume aspects of big data systems (Baru, Bhandarkar, Nambiar, Poess, & Rabl, 2012).

4. Group 4:

Group 4 focusses on the topic of “Applications of Big Data” in various fields and areas such as mobile computing, life science, instruments, genomics, healthcare, government, etc. (Costa, 2012).

5. Group 5:

Group 5 covers the topic of “Security and Ethics of Big Data” that addresses code of conduct, data security, information ownership, human privacy, human subject, and relates risks (CITI Program, 2015).

6. Group 6:

Group 6 constructs the topic of “Knowledge Management (KM) and its Applications” that establishes framework, knowledge innovation, and KM applications. Data is intellectual capital that is valuable enough to be identified, managed, and protected, perhaps granting a

competitive advantage in the marketplace (Tsai, 2013).

7. Group 7:

This group provides a guide for novice researcher on research methodology (Ellis, & Levy, 2009).

In the Microsoft .pptx graphic slides, a linear line is drawn from the origin at lower left corner up to the upper right corner. All thirty topic boxes are tagged along this straight line into seven groups of the topics and subtopics in drafting the literature review. The straight line represents the linear conceptual framework model that consists of seven topic groups. The graphic of conceptual framework has four slides in sequential order:

a. The first slide displays four topic groups 1, 2, 3, and 7 on the linear line.

b. The second slide displays three topic groups 4, 5, and 6 on the extending line.

c. The third slide shows the staircase figure that includes (1) General problem, (2) Specific problem, (3) Purpose, and (4) Research question toward Qualitative methodology (Ql) (Bryman, & Bell, 2011).

d. The fourth slide shows the staircase figure similar to the third slide, but the research question is toward Quantitative methodology (Qn) (Creswell, 2014).

Figure 2.1: A linear model is shown in the graphic of the conceptual framework of the literature review on Big Data Analytics research.

Slide 1: The first four topic groups 1, 2, 3, and 7 focus on Big Data and its applications (Brown, Chui, & Manyika, 2011; Xian, & Madhavan, 2014).

Slide 2: The last 3 topic groups 4, 5, and 6 address Big Data, security, ethics, knowledge, and knowledge management (Alavi, & Leidner, 2001; Albescu, Pugna, & Paraschiv, 2009; Birasnav, Goel, & Rastogi, 2012).

Slide 3: A conceptual framework for a Qualitative method that includes general problem, specific problem, the purpose of the research study, and research question (Alasuutari, 2010).

Notice that for Ql design method, the Qualitative interview strategies will use one of four distinct types of Ql interviews: Focus groups, Online Internet interviews, Casual conversations and in-passing clarifications, and Semi-structured and unstructured interviews (Rubin, & Rubin, 2011). Two additional interview strategies proposed by other scholars are In-depth interviews and Projective technique interviews (Hargiss, 2015). These Ql interview strategies are different from each other based on the interviewer’s role, interviewees’ participation, and the relationship between an interviewer and the interviewees. Semi-structured and unstructured interviews can be categorized as in-depth interviews described above. Both semi-structured and unstructured interviews are the extended conversations between an interviewer and interviewee (Rubin & Rubin, 2011). In semi-structured interviews, a researcher has a specific topic to learn in-depth and plans questions in advance and asks follow-up questions in a narrow scope. In unstructured interviews, a researcher has a general topic in mind, but specific questions may be generated as the interview proceeds, in response to what the interviewee says in generic scope. The purpose is similar to in-depth interviews. In the in-depth interviews, a researcher as a well-trained interviewer will conduct a face-to-face interview with a participant or an interviewee. A set of probing questions will be provided to the interviewee. The interviewer encourages the interviewee to express the point of view in the larger scope. The purpose is collecting as much as memory, attitudinal and behavioral data from the interviewee (DiCicco-Bloom & Crabtree, 2006).

The target population consists of two groups. Group 1 includes numerous graduate and doctoral students who study on Big Data and concentrate on data analytics at CTU or other universities. Estimate the size of Group 1 is ten (10) students. Group 2 consists of professionals and analysts who work and handle Big Data generated in a variety of the fields such as e-commerce and market intelligence, e-government and politics, science and technology, smart health and well-being, and security and public safety. Estimate the size of Group 2 is another ten (10) participants. The total estimate of participants who can present a variety of views, and who are willing to talk to the interviewer is twenty (20) professionals.

This student researcher selects purposefully participants and sites for sources of data in qualitative research. Qualitative data comes from many sources, for example, field notes, existing documents, interviews, audio, and video tapes. The interview process is one of three stages (i.e., interview process, observational process, and artifact review) in the qualitative strategies. Qualitative data collection strategies primarily include interview techniques and open-ended items on a survey.

The purposive sampling is the selection of graduate and doctoral students in academic community and professional analysts from business and industry because they can purposefully inform an understanding of the research problem and the central phenomenon. The purposive sampling plan is a non-probability sampling strategy in which participants are selected based on predetermined criteria such as their knowledge, understanding, and experience on the Big Data topic. The purposive sampling also bases on these students’ relevance to the research questions.

Slide 4: A conceptual framework for a Quantitative method (DeVault, 2015).

In quantitative research, Qn data collection strategy is some survey techniques. A survey is performed on a population defined by the research objectives in a study. The population may be tangible or abstract. Statistical inference is made on a population based on data from a sample where the sample is a representative subset of the population (Gall, Borg, & Gall, 2013). The sample consists of two types: probability and non-probability. In probability, each member of the tangible population must be known before sampling occurs. Each member must have an equal chance of being the sample. In non-probability, this kind of sample can be used either tangible or abstract population.

Data play a vital role in descriptive statistics. The form of data is the numbers in data collection, particularly Qn statistical analysis. According to Field (2015), the secret of life on people is hidden in numbers. To discover or reveal the secret of life, it requires a large-scale analysis. Researchers use data in the form of numbers or values to represent people who are participants, organizations or subjects under a research study. Researchers may get lost in numbers. They may ignore the research study’s objectives, goals or purpose because they constantly deal with data, information in the form of numbers or values every day. They forget the research objective is to contribute to the knowledge pool, to improve human life or to provide more benefit to people, organizations, and environment (Huck, 2015). For example, people, who watch the weather forecast, football games, stocks market, etc., all see the numbers such as ambient temperatures, game scores, stock values.

Notice that the conceptual framework depicts both Ql and Qn methods, but the proposed research dissertation will perform either one of them only due to the heavy workload and time constraint in the third academic year at CTU.

In summary, the proposed research topic is “A research study of extracting and transforming Big Data, particularly huge healthcare data sets, as values without context into meaningful information such as hidden common pattern correlation, the frequent predictive occurrence of events that benefits humans in the evolution of DIKW (Data, Information, Knowledge and Wisdom) (Ahlemeyer-Stubbe, & Coleman, 2014).” This review of the literature provided a review of the literature that includes an introduction on big data retrieval for insightful information to assist decision-making in business. It addressed the reviews theoretical literature that consists of recent and seminal literature. The literature review consisted of contextual literature that comprises most recent journals, credible articles, and scholarly periodicals in three sections focusing on (1) what big data is, (2) a discipline of big data analytics, and (3) meaningful insights. It mined deeper in both theoretical and contextual literature to find several gaps in a science of big data, DIKW evolution, and universal standardized big data analytics software tool. The gap of knowledge in big data and data analytics includes (1) a new discipline in big data science, (2) DIKW evolution toward knowledge management and content management, (3) enhancement of disease treatment and improving health care services, and (4) universal analytics tools for processing data effectively such as capturing, storing, analyzing various large unstructured and semi-unstructured data sets and other issues like data sharing, transfer, analysis, curation, and result presentation. It also presents the conceptual framework based on the topic in a linear model that presents a straight line with seven topic groups from one hundred twenty-two (122) references in a descending staircase of the general research problem, specific problem, research purpose, and especially research questions that can lead to select either a qualitative methodology or quantitative methodology. Notice that since the research study is very time consumed with a heavy workload, either methodology will be selected for conducting the research study at CTU (Colorado Technical University) only. The next Chapter Three will discuss the method design in the research study of Big Data Analytics’ impact on Big Data.

TSL Blog

Wednesday, November 16, 2016

Transforming Big Data into Quantitative Insights

2 comments: