Monday, September 26, 2016

Design Proposal on the Cyber-Healthcare System

Design Proposal on the Cyber-Healthcare System
by TSL
September 10, 2016

In general, a design proposal is a descriptive document of a certain product prepared by architect(s) or a design team to provide an overall guidance of the architecture in the project to the development team, and it is also a primary informative document of the design to an organizational management (IEEE, 2009). The product can be a hardware instrument, software package or a system. The design document may contain many sections and sub-sections that include a general overview, goals, flow chart, diagrams, specific specifications, assumptions, and limitations, etc. (CMS.gov, 2005). In this Unit 5 Individual Project (CS882 U5 IP), a newly hired big data analyst will present a design proposal using Hadoop ecosystem to analyze various large structured and unstructured datasets for the insights in a chain of the state-of-the-art hospitals and other health services in four states (Arizona, Colorado, New Mexico, and Utah) in the US. A Hadoop solution will address specific business problems in the healthcare field. The design proposal consists of eight sections as follows:
I. Introduction
II. Specific requirement design
III. Data flow diagrams
IV. Overall system diagrams
V. Communication Flow Chart
VI. Regulations, policies, and governance for the medical industry
VII. Assumptions and limitations
VIII. Justification

I. Introduction
This document provides a design proposal concerning a Hadoop solution that is applied to the business problem of analyzing various large data sets in a hospitals network system in four corners region, i.e., Arizona, New Mexico, Colorado, and Utah. The design proposal of the Cyber-Healthcare System gives corporate management a descriptive information and guidance of the architecture in the project that solves business problems of analyzing huge sets of scattered complex data in the healthcare system. The document is also provided the related readers an generic informative overview of the project. 
With cloud computing, automation, Web technologies like artificial intelligence,
Internet of Things (IoT) in the competitive Internet-centric market and data-driven economy, big data at low storage cost, particularly complex data in the healthcare area, has exploded and become ubiquitous and available almost everywhere. Analyzing mostly unstructured data at a colossal volume such as personal health records, clinical health information, or public health data for insights usually poses many practical challenges and real business problems to many organizations such as hospitals, clinics, etc. in the area. For example, presentation of the analytical results from a real-time analysis is an issue in corporate management.

II. Specific requirement design
The design proposal of the Cyber-Healthcare System (The System) provides an overall description of the solution in data analytics to solve the challenges and business problems in healthcare field with the following requirements:
     A. Large data sets
The broad and complex data sets processed in the Cyber-Healthcare System that include some structured and unstructured data will be stored in a reliable centralized on-line repository. Data sets can be replicated and shared among nodes in the scalable distributed clusters. A backup system will provide a safeguard and recovery if some disasters such as hacking or risks of the loss happen. Insightful information extracted by Hadoop system will be categorized into three categories (Schneiderman, Plaisant, & Hesse, 2013):
          1. Personal health information:
Physicians and patients collect information about their practice and own health habits.
          2. Clinical health information
Electronic health records systems can enhance a health care or cure to patients, and useful insights into pragmatic patterns of treatment.
          3. Public health information
A large quantity of public health data is collected to assist policy makers in more reliable decisions.

Figure 1: Big data in healthcare area includes structured data and unstructured data in the 2/3 section of the data pie.

Source: Adapted from AllSight (2016).

     B. Hadoop ecosystem 
To extract and transform the complex, huge data sets of healthcare for insightful information, the Cyber-Healthcare System will implement and deploy the Hadoop ecosystem to hospitals in the area. As the de facto standard to manage big data, Apache Hadoop - an open source Java-based framework that uses parallel data processing across distributed clusters - is chosen for this project (Apache Software Foundation, 2014). A simplified Hadoop architecture includes four major components:
          1. Hadoop Common
The component consists of Java libraries and utilities to support other components.
          2. Hadoop YARN
The component does job schedules and manages cluster resources.     
          3. HDFS (Hadoop Distributed File System)
The HDFS provides high-throughput access to application data.
          4. Hadoop Map/Reduce
It performs Map and Reduce functions on large data sets in parallel processing to retrieve insightful health information for patients, clinics, and hospitals.

Figure 2 shows a simplified Hadoop framework with four components: YARN Frameworks, Common Utilities, HDFS, and Map/Reduce Computation.
Source: Adapted from Hadoop Software Foundation, 2012.

Figure 3 displays a high-level HDFS architecture with name node and multiple data nodes in data processing.
Source: Adapted from Borthakur (Apache Hadoop Organization, 2012).

                Healthcare tools in the Cyber-Healthcare System are designed to assist health authorities in long-term plans, business strategies, and healthcare policies. The healthcare tools include diagnostic tools for monitoring. evaluating, and assessing. Other tools are used to support priority scheduling, identify effective strategies, evaluate the cost, plan resource, calculate budget, and program and implement tasks (WHO, n.d.).
     C. External interfaces
            The Cyber-Healthcare System will allow related users such as nurses, physicians, to enter data or view and search health information in the Hadoop ecosystem. Patients can access and view their health records only. However, administrators and designers have privileges and authorization in options such as read/write/delete/save or change.  
          1. User interface
            There is one unique graphic user interface (GUI) for three types of users.
               - The GUI with basic privilege is provided to patients who can read, view, print out individual health records, information. They can schedule appointments, send emails for questions, etc.
                - Nurses, physicians or data entry workers are provided more privileges such as to read, view health information, enter data, search or query for useful information, etc. on the GUI.
                - Administrators and designers have full privileges such as read, write, delete, change, query, extract data, etc. with full privileges on the GUI.
          2. Hardware interface and software interface
            The Cyber-Healthcare System with a backbone of Hadoop environment supports NoSQL databases, aggregate data models, and key-value databases to perform the map-reduce computing and store the results of the mappers and the reducers in the materialized views with high fault-tolerance. Users can use industry standard formats like XML, JSON, texts on complex data.
            The hardware interface comprises personal computers, desktops, laptops, Smartphones, iPhones, iPads, etc. (Natarajan, 2012).
            Software interface includes:
               a. Platforms:
                    - OS Windows 7, 8, 8.1, 10 (32-bit, 64-bit)
                                       Windows Server 2008 (64-bit)
                                       Windows Server 2012 (64-bit)
                                       Windows Server 2012 R2 (64-bit)
                                       Windows Vista SP1 and later (32-bit and 64-bit)
                     - Mac OS X hosts (64-bit)
                                        Mavericks: 10.9
                                        Yosemite: 10.10
                                        EI Capitan: 10.11
                     - Linux hosts (32-bit or 64-bit)
                                        Ubuntu 10.04 to 16.04
                                        Debian GNU/Linux 6.0 (“Squeeze”) and 8.0 (“Jessie”)
                                        Oracle Enterprise Linux 5, Oracle Linux 6 and 7
                                        Redhat Enterprise Linux 5, 6 and 7
                                        Fedora Core / Fedora 6 to 24
                                        Gentoo Linux
                                        openSUSE 11.4 to 13.2
                     - Solaris hosts (64-bit only)
                                        Solaris 11
                                        Solaris 10 (U10 and higher)                                      
               b. Emulated hardware
                     - Input devices: Standard PS/2 keyboards and mouse
                     - Graphics: Standard VGA devices
                     - Storage: Intel PIIX3/PIIX4 chips, the SATA (AHCI) interface, and two SCSI adapters (LSI Logic and BusLogic)
                     - Networking: Linux kernels version 2.6.25 or later
                                            Windows 2000, XP and Vista, drivers
                     - USB: xHCI, EHCI, and OHCI
          3. Nonfunctional requirements
                a. Security:
                        General security principles are used securely
                        - Update software
                        - Safeguard Network Access to High Priority Services
                        - Obey the Least Privilege Principle
                        - Watchdog System Activity
                        - Maintain and upgrade on the Latest Security Information
                b. Performance
                        - Poor performance caused by host power management
                        - Performance variation with frequency boosting
                c. Policy
                        Cyber-Healthcare System works in harmony based on trained and motivated health workers’ inputs. It is designed in a well-logic infrastructure,
and a stable supply of technologies and medicines, supported by well funding, powerful health plans and make-sense policies (WHO, n.d.). 
                d. Business rules
                         Business rules explain the definitions, operations, and constraints that use in the Cyber-Healthcare System. The users who use the Cyber-Healthcare System are required to follow all the rules in the signed agreement when they sign up or join the System. 

III. Data flow diagrams
            Healthcare data is processed dually in traditional databases in data warehouse and Hadoop in ETL (Extract, Transform, and Load) process in parallelism as shown in Figures 4, and 5 below:

            Figure 4: Process flow of the large healthcare data in Map/Reduce functions in Hadoop system.
Source: Adapted from Intel, 2016.


            Figure 5: Data flow in both traditional data warehouse and Hadoop subsystem in parallelism in the Cyber-Healthcare System. Data science means data analytics that is a process of data analysis for retrieval of insights. 
Source: Adapted from Intel, 2016.


            In the Cyber-Healthcare System, a XML data flow document can be written in XML format. For example, a typical XML design document is programmed as follows:

     <?xml verson=”1.0”?>
     <!—File name: TheCyberHealthcareSystem.xml -->     
          <Group>
               <Groupname>Arizona</Groupnames>
                    <Hospital>XXX</Hospital>

                         <DeptInternalMedicine>AAA</DeptInternalMedicine>
                         ….
                         <DeptIntensiveCare>BBB</DeptIntensiveCare>
                         …..
                         <DeptFamilyCare>BBB</DeptFamilyCare>
                         ……..
                    ….
                    <Clinic>YYY</Clinic>
                    …..
                    <Nursinghome>ZZZ</ Nursinghome>
                    ……
               <Groupname>Colorado</Groupnames>
                    <Hospital>III</Hospital>
                         <DeptInternalMedicine>OOO</DeptInternalMedicine>
                         ….
                         <DeptIntensiveCare>PPP</DeptIntensiveCare>
                         …..
                         <DeptFamilyCare>QQQ</DeptFamilyCare>
                         ……..
                    ….
                    <Clinic>JJJ</Clinic>
                    …..
                    <Nursinghome>KKK</ Nursinghome>
               …..  
          </Group>

     <Patient>
          <Patientname>StevenConte</Patientname>
               <PatientID>7742661926</PatientID>
               <PatientDOB>04301975</PatientDOB>
               <PatientAddress>XYZ</PatientAddress>
               <PatientOccupation>zyx</ PatientOccupation>
               <PatientAge>46</PatientAge>
               <PatientHeight>5ft8Inch</PatientHeight>
               <PatientWeight>150</PatientWeight>
               <PatientHeight>5ft8Inch</PatientHeight>
               <PatientIllness>Vertigo</ PatientIllness>
               …………
     </Patient>

     ………..

IV. Overall system diagrams
            The overall Cyber-Healthcare System based upon a Hadoop ecosystem consists of a Hadoop YARN, Common Utilities Unit, HDFS, and Hadoop Map/Reduce. Hadoop YARN is a communication and control unit that provides job scheduling and cluster resource management. Common Utilities Unit is a supportive unit to provide libraries and utilities. HDFS provides accessing to health data sets. Hadoop MapReduce applies parallel processing on the healthcare large data sets effectively. The large data sets in Tetra Bytes are broken into 64 or 128 MB and stored in multiple low- cost commodity nodes in HDFS for Map and Reduce functions to retrieve insightful information for end-users such as patients, frontline care providers (e.g., nurses, physicians, healthcare technologists, etc.).

            Figure 6 shows the overall central Hadoop System with the control unit Hadoop YARN and supportive unit Common Utilities in the Cyber-Healthcare System.


Source: Created by TSL, 2016

V. Communication Flow Chart
In communication, the Cyber-Healthcare System consists of the central Hadoop ecosystem that connects to four groups, i.e., Arizona, Colorado, New Mexico and Utah in star configuration as shown in Figure 7 below. Each group is linked to local hospitals, outpatient clinics, nursing homes, and rehabilitation centers. Each organization has
many departments. Each department has its own care team or frontline care providers that include physicians, nurses, and family members who provide health care services to patients. Also, the environment group that comprises regulators, Medicare, Medicaid, insurance companies, healthcare purchasers, and research funders can communicate with institutions such as hospitals, clinics, nursing homes, rehabilitation centers, etc.

Figure 7 depicts a high-level communication flow chart among agencies in the Cyber-Healthcare System.
Source: Created by TSL, 2016

VI. Regulations, policies, and governance
            The Cyber-Healthcare System complies with all regulations, policies, and governance for the medical industry in its design as follows:
     1. Regulations
In practical view, market research and ethnics in healthcare data based on Internet technology are usually at odds with each other. Big Data Analytics (BDA) presents both technical and strategic capabilities to generate value from the data they store for the organizations. With the blossom of BI (Business Intelligence) and BDA, there will be more security violation and privacy issues (Quora, 2014). There is a prominent risk of violation of the personal privacy. For example, terrorists likely hack healthcare systems such as the Cyber-Healthcare System to sabotage the system, harm people, and take advantages for their own ideology, politics, or religion. The System considers the issues seriously and uses the latest antivirus software, firewall, etc. to protect the integrity of data and safeguard patients’ information. The System comply the government’s controversial in-depth regulations and obeys all medical rules. Notice that the Cyber-Healthcare System will work to obtain ISO 9001 Certification in the healthcare industry (Nolan, 2015).
     2. Privacy Policies
            Information about users’ uses of the website is collected by using a tracking cookie, and server access logs. The collected information includes the following:
          a. The IP address from which user accesses the website.
          b. The type of operating system (OS) and browser user uses to access the System site.
         c. The date and time user accesses the Cyber-Healthcare System site.
         d. The html pages users visit.
         e. The pages addresses from where user followed a link to the System site.
            Some of the information is gathered by using a tracking cookie set from the Hadoop Analytics or Google Analytics service in the privacy policy. Users may refer the browser documentation for instructions on how to disable the cookie if they do not want to share the data with Hadoop or Google.
            The Cyber-Healthcare System gathers information to make the website more useful and friendly to visitors and better understanding how and when the website is surfed. The Cyber-Healthcare System does not collect or track personally identifiable information, or associate gathered data with any personally identifying information from the other sources.
            By using this website, user consents to the collection of this data in the manner and for the purpose to solve the challenges and business problems in healthcare field (Hadoop.apache.org, n.d.).
     3. Governance
HIPAA is the federal Health Insurance Portability and Accountability Act of 1996 in Tennessee. It was designed to safeguard healthcare information, assist people to retain health insurance, and facilitate administrative costs’ control in the healthcare industry (HIPAA, 1996). On the privacy issue, HIPAA emphasizes on protection and maintenance of personal health information in all health-related organizations. HIPAA requires (1) frontline providers (e.g., physicians, nurses, etc.), (2) medical producers (e.g., pharmaceutical, medical device companies, etc.), and (3) payers (e.g., insurance companies) must comply all the law and rules in governance. 
            The Cyber-Healthcare System comply all HIPAA governance rules.

VII. Assumptions and limitations
            The Cyber-Healthcare System is developed and designed based on the following assumptions and limitations:
     1. Assumptions (Flower, 1999):
          - The System’s clients are patients.
          - The System’s contact with patients is high intensity, low touch.
          - Doctors are independent carriers of information and judgment.
          - Healthcare is event-driven.
          - Much of ill health will be predictable and preventable. 
          - Patients will be partners in managing their health.
          - Data in the System’s centralized repository is assumed clean, reliable, and credible.
          - All institutions such as hospitals, clinics, nursing homes, etc. use the same platform to access, view, query, and enter the large data sets in the centralized repository.
          - All frontline care providers in the care team are trained to use the System properly and professionally.
          - The System keeps all sources of time visible to the guest synchronized to a single time source, the monotonic host time.
     2. Limitations (Hortonworks, 2016)
          - Some experimental features are beta (labeled as experimental). Such beta features are provided but are not formally supported. However, users’ suggestions and feedback are welcome.
          - Poor performance with 32-bit AMD CPUs may affect Windows and Solaris platforms.
           - Poor performance with 3-bit Intel CPU model affects mainly on Windows, Solaris, and Linux kernel.
          - NX (no excuse, data execution prevention) only works for 64-bit OS computers
          - Windows XP has slower transmission rates because it supports segmentation offloading.
          - Shared folders are not supported on the OS/2 computers.

VIII. Justification
            The Cyber-Healthcare System is a modern state-of-the-art system in the contemporary network of hospitals, outpatient clinics, nursing homes, and rehabilitation centers in the 4-state region. The System is developed to eliminate isolation among hospitals, reduce inefficiency in care management, and prevent a loss of opportunities for advancing patient treatments. The Cyber-Healthcare System is designed with the following justifications:   
     1. Centralizing the scattered sources of colossal data sets from many agencies, various hospitals, and clinics.    
     2. Transforming unreliable huge data sets with duplication and redundancy in data and information to credible and reliable data sets.
     3. Establishing a large healthcare network system in the region to allow users such as patients and frontline care providers (physicians, nurses, family members) with the different privilege to access, view, search information that is needed or required for patient treatments, and cures at low cost possible.
     4. The Cyber-Healthcare System is implemented in Hadoop environment as described in Section II.B Hadoop Ecosystem above.  
     5. The System’s architecture is developed based on four target elements:
          a. Patients.
          b. Care team consists of physicians, nurses, family members.
          c. Organization includes infrastructures, resources such as hospitals, clinics, nursing homes and rehabilitation centers.
          d. The environment comprises regulation, policy, and market like regulators, Medicare, Medicaid, insurance companies, healthcare purchaser, research funders, etc.
     6. The System is designed to tackle the huge data sets’ challenges in the healthcare industry. Some healthcare data challenges are:
          a. Capturing data is difficult.
          b. Curation is not easy.
          c. Storage requires huge memory, disks.
          d. Sharing data is complicated. 
          e. Transfer data take a lot of time because of huge size. 
          f. Analysis of data requires advanced analytical tools.
          g. The presentation is sophisticated.
     7. Organizations in the System can provide better and high-quality services based on historical data from previous medical records of patients.
     8. The System has data visualization feature for users to access (Schneiderman, Plaisant, & Hesse, 2013):
          - Personal health information
          - Clinical health information
          - Public health information

IX. Summary
            This Unit 5 Individual Project document presented a design proposal of the Cyber-Healthcare System that used Hadoop environment to process and analyze huge data sets in healthcare in the four corners area like Arizona, Colorado, New Mexico, and Utah. The proposal included eight sections as follows:
I. Introduction
This section provides a quick overview of the Cyber-Healthcare System.
II. Specific requirement design
            The section explains an overall description, external interface requirements such as user interface (GUI), hardware interface (computers, laptops, iPad, smartphone, etc.), software interface (OS, Platforms, etc.), and communication interface (if any of these apply, nonfunctional requirements such as security, performance, policy, business rules) at the high level. The large and complex data in specific healthcare, Hadoop ecosystem with various platforms, and other features are explained in details.
III. Data flow diagrams
            This section describes a description of the data flow, flow of communication, and data processing in parallel MapReduce functions were displayed in several diagrams with labels and a typical XML programming code.
IV. Overall system diagrams
The section discusses an overall system design with the communication and control unit such as Hadoop YARN and HDFS modules.
V. Communication Flow Chart
The simplified communication flow chart used to connect the four geographical states is displayed in a star configuration with a high level description.
VI. Regulations, policies, and governance for the medical industry
            This section provides regulations, policies, and governance in HIPAA for the medical industry considered in the Cyber-Healthcare System.
VII. Assumptions and limitations
            Several assumptions and limitations applied in the design of the
Cyber-Healthcare System are described and mentioned with technical information of the typical operating systems.
VIII. Justification
Eight justification and rationales of the System’s design are summarized in this section.

            In general, the Cyber-Healthcare System that is a huge project is implemented on Hadoop backbone to provide personal information, clinical health information and public health information to help hospitals, outpatient clinics, insurance companies, healthcare purchasers, etc. to provide the high-quality of effective healthcare services at the low cost to patients in this central region.   


REFERENCES

Apache Software Foundation (2014). What is apache hadoop?  Retrieved November 08, 2015 from http://hadoop.apache.org/

Borthakur, D. (2012). HDFS architecture. Retrieved August 08, 2016 from
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Streaming+Data+Access

CMS.gov (2005). System design document. Retrieved July 31, 2016 from
https://www.cms.gov/Research-Statistics-Data-and-Systems/CMS-Information-Technology/XLC/ 

Flower, J. (1999). The revolution in our assumptions about healthcare. Retrieved
September 08, 2016 from http://www.well.com/~bbear/assumptions.html

Hadoop.apache.org. (n.d.). Privacy policy. Retrieved September 8, 2016 from
http://hadoop.apache.org/privacy_policy.html

HIPAA Act, (1996). The federal health insurance portability and accountability act. .
Retrieved September 08, 2016 from http://tn.gov/health/topic/hipaa.

Hortonworks.com, (2016). Hortonworks sandbox. Retrieved September 07, 2016 from
www.hortonworks.com/products/sandbox.

IEEE (2009). 1016-2009  -  IEEE Standard for Information Technology--Systems
Design--Software Design Descriptions. Retrieved July 31, 2016 from http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5167255&isnumber=5167254&url=http%3A%2F%2Fieeexplore.ieee.org%2Fstamp%2Fstamp.jsp%3Ftp%3D%26arnumber%3D5167255%26isnumber%3D5167254
Or http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5167255&isnumber=5167254

Intel, (2015). Extract, transform, and load big data with apache Hadoop. Retrieved September 7, 2016 from http://hadoop.intel.com and
https://software.intel.com/sites/default/files/article/402274/etl-big-data-with-hadoop.pdf

Natarajan, R. (2012). Apache Hadoop Fundamentals – HDFS and MapReduce Explained with a Diagram. Retrieved November 01, 2015 from http://www.thegeekstuff.com/2012/01/hadoop-hdfs-mapreduce-intro/

Nolan, J. (2015). Would hospitals benefit from ISO 9001? Retrieved September 08, 2016
from http://advisera.com/9001academy/blog/2015/07/21/would-hospitals-benefit-from-iso-9001/

Quora (2014). What is the future of business intelligence?  Retrieved October 20, 2015 from
http://www.quora.com/What-is-the-future-of-business-intelligence.

Schneiderman, B., Plaisant, C., & Hesse, B. (2013). Improving healthcare with

interactive visualization methods. Retrieved September 06, 2016 from https://www.cs.umd.edu/~ben/papers/Shneiderman2013Improving.pdf 






Wednesday, September 21, 2016

Big Data Visualization Tools

Big Data Visualization Tools
Written by TSL
August, 26, 2016
A. Introduction
With the modern Internet technologies, automation and various Webscale, enterprise, cloud, and data computing in the competitive data-driven market and Internet-based economy, data at low storage cost and fast processing explode ubiquitously in both public and private sectors (Gartner, 2016). Big Data, a generic term for data in 5V’s (massive Volume, Variety in different forms, high Velocity in processing, truthful Veracity, and Value) pose major challenges in capturing and extracting meaningful information to many organizations (Davis, 2016). Recently, many organizations use data visualization and analytics to retrieve insights from their data asset for making sound decisions, increasing productivity, acquiring new customers, or gaining a competitive edge. Today, there are many visually analytical tools to perform big data analytics, particularly data visualization for presentations in the market. For example, Machlis (2011) provided twenty-two free data visualization tools for analytics; Jones (2014) listed ten leading data analytics tools in the business market, and Lurie (2014) addressed more than thirty-nine data visualization tools for cloud computing. 
This document will present ten big data visualization tools that are available in the data-driven market. It includes a short descriptive summary, typical features, colorful snapshots, highlighted benefits or advantages, and some drawbacks or disadvantages. Ten visualization and analysis tools are R Project, Google Fusion Tables, Tableau Public, VIDI, Google Chart Tools, Splunk, Qlikview, KNIME, IBM Watson Analytics, and Microsoft Power BI. All of them are free for basic applications with limited memory space (e.g., less than 500 MB per day). Some vendors of these advanced products offer a free download, training, and support services, but the others may require users to use subscription services for large enterprise projects.

B. Big Data Visualization Tools  
Many big data visualization tools have emerged in popularity over the past few years. Imanual (2015) from Predictive Analytics Today News addressed that data visualization tools make significant impacts in organizational presentations of the results. They play a crucial role in understanding data analytics outcomes. Ten of the most popular data visualization tools are addressed as shown below: 
     1. R project
          a. Summary
R Project is perhaps one of the most analytical tools in big data analytics and data visualization (Minelli, Chambers, & Dhiraj (2013). It is an open-source programming language developed for ease of use. R is the first choice in statistical analysis such as processing massive datasets in building data models with multi-purpose capability visualization. Its platform includes Linux, Mac OS X, Windows XP and later. Its skill level is 4 for users who are experienced programmers.
          b. Benefits
Users use R for the applications to find hidden patterns and unknown correlations, in-depth relationships in statistics. R can be integrated with Apache Hadoop, MapReduce or SQL Server. It has the capability in data visualization. Google uses R for statistics, data manipulation, and visualization in many services. Facebook uses R to create statistical reports to improve news feed and services. R language was also recommended to use in other fields such as healthcare, manufacturing, and marketing. R becomes popular in R communities because it is free.  
          c. Drawback
Notice that both R and Apache Spark are the popular big data analytics tools today. However, R has slower performance than Spark does because it uses memory in the hard disk drive. Users who use R are usually the experienced or highly skillful programmers because R requires the command lines.
Figure 1: R Studio GUI displays a large dataset of “extyags.nw” from one of the R library packages.
            Source: Adapted from CTU CS872 Unit 4 Individual Project, 2015.

     2. Google Fusion Tables
          a. Summary
Google Fusion Tables is a Web-based service for data management used to gather, visualize and share data tables. Data are captured and stored in multiple tables for viewing or download. It provides dataset visualization and mapping. Its platform is a browser such as Chrome, Netscape, etc. (Halevy, & Shapley, 2009).
          b. Benefits
            Data can be displayed visually in different forms such as bar charts, pie diagrams, line plots, timelines, scatter plots, or geographical maps. The data can be exported in a comma-separated values file format. It has a skill level 1 for users who have some basic spreadsheet knowledge. Google Fusion Tables is free and easy to use.
          c. Drawback
It has a limitation on customization and functionality in interacting on the massive datasets. The files that are uploaded to Google servers are limited to 250 MB for each user. Google supports data sets of 100 MB.
            Figure 2: Google Fusion Tables displays the US map data.
            Source: Adapted from http://www.computerworld.com/article/2507728/enterprise-applications/enterprise-applications-22-free-tools-for-data-visualization-and-analysis.html?page=3#fusiontables, 2015.

     3. Tableau Public
          a. Summary
Tableau Public is an analytical tool for interactive data visualization focused on business intelligence that uses a database visualization language (VizQL). Users can use Tableau Public to query data in the tables from relational and cloud databases, or Excel spreadsheets then generate many graphs combined into dashboards or shared over the Internet or networks. Tableau platform includes Windows, OS X. It skill level is rated 3 for power users (Chabot, 2014).  
          b. Benefits
Tableau is a powerful analytical tool in industries because it captures and extracts insights for data visualization presentation. Its software has won many awards for the Best Use Overall in data visualization. With the limitation of million rows, Tableau Public provides a practical playground for individual use. Tableau's visual information provides users a great means to verify hypotheses timely, explore the data, and check sanity (Jones, 2014). Tableau Public and Tableau Reader are free but its cousins such as Tableau Mobile, Tableau Desktop, etc. require subscription services.  
          c. Drawback
Tableau Public and Tableau Reader are free, but other products such as Tableau Desktop, Server, Online must be paid for an annual subscription. Tableau Public does not have the ability to create multiple dimensions in a custom group. Its limitation of creating new relationships arises in the configuration of new knowledge. At skill level 3, Tableau Public is useful for power users or programmers.
            Figure 3: Screenshot of Tableau Public.

Source: Adapted from https://public.tableau.com/s/, 2016.

     4. VIDI
          a. Summary
VIDI includes a set of the Drupal modules for creation of visual data displays. Users can display changes in data values over time in geographical maps or present static datasets in various types of charts in the Drupal system of the content management. VIDI platform is a browser with a skill level of 1 (Dataviz.org, 2016).
          b. Benefits
VIDI can capture patterns, essential themes in huge data sets very rapidly through visual means. It has many mapping options in Many Eyes at the file size of 5 MB with colorful customization. Visualization wizard makes the tool ease of use.
          c. Drawback
The VIDI tools to create these visual representations are usually too expensive and challenging for smaller news organizations and everyday citizens to use. Embed code iframe may not display properly on VIDI website.
            Figure 4: VIDI’s wizard displays a graphic on the HTML page.
            Source: Adapted from http://www.computerworld.com/article/2507728/enterprise-applications/enterprise-applications-22-free-tools-for-data-visualization-and-analysis.html?page=5#vidi, 2015.

     5. Google Chart Tools
          a. Summary
Google Chart Tools (GCT) provides a simple set of API to build customarily interactive SVG charts. GCT can visualize date at elsewhere. It supports organizational charts and geographic chart. It also provides analytics dashboard for creating an analytics page with time frame filters for charts visibility (Konforti, 2012). The platform bases on code editor and a browser. It is rated at a level skill of 2 (Machlis, 2016).
          b. Benefits
Google Chart Tools module includes a built-in library for visualization applications. Its service is rated excellently. The comprehensive API sets can take data in from a Google spreadsheet. Google Chart Tools are powerful, simple to use and free.
          c. Drawback
Google Charts do not allow users to download the google.load or google visualization code for offline using. User charts that use Google Chart Tools module will not work on IE8 (Internet Explore version 8) because IE8 does not support SVG feature. 
The API requires some coding that pushes it more programming tool.
            Figure 5: Google Chart Tools displays an analytics dashboard.
Source: Adapted from Konforti, 2016.

     6. Splunk
          a. Summary
Splunk is a data visualization tool used to search, monitor and analyze big data generated by machines or sensors through Internet browsers or Web-based interface. Splunk can capture, provide indexes, and correlates real-time data for charts, graphs, diagrams, reports, dashboards, and visual displays (Harris, 2010). Its platform includes Windows 7, 8, 10, and Windows Servers, Linux, Solaris, and Intel ICX 10.9.
          b. Benefits
Splunk provides machine-generated data accessible across an organization by providing metrics, identifying data patterns, diagnosing problems and providing intelligence for business operations. It can connect to any database sources for analysis.
          c. Drawback
Splunk builds on data indexing on the logs generated by machines and sensors, but it does not keep in view Business Intelligence objectives.
            Figure 6:  Splunk displays a search on all machine data in a real time.
            Source: Adapted from http://www.splunk.com/en_us/products/splunk-enterprise/features.html, 2016.

     7. Qlikview
          a. Summary
Qlikview has the ability of simple drag and drop techniques in self-service in the creation of data visualization without writing many SQL query commands. Qlikview can connect various databases from different vendors into Qlikview's centralized repository. It has intelligent indexing method to discover new data for patterns and trends in different data types. Qlikview provides dashboards to aid decision support systems. Its platform uses 64-bit Windows with a skill level of 2 (Qlik, 2015).  
          b. Benefits
Qlikview accepts dynamic data type formats from any source to its in-memory analytics platform. It has many channels of documentation for building big data quickly without disruption without downtime.
          c. Drawback
Schemaless or dynamic schema is not used to connect to the data source in Qlikview. It does not have a powerful graphics in comparing with Tableau’s. Qlikview is not a free product, but Qlik Sense is free.
            Figure 7: Qlikview displays a sample analysis.
            Source: Adapted from http://www.computerworld.com/article/2920545/business-intelligence/qlik-sense-free-dataviz-app-adds-public-private-sharing.html, 2015.

     8. KNIME
          a. Summary
    KNIME is an open source platform for data integration, analytics, and reporting. It
allows users to program visually for analyzing, manipulating, and modeling data in a credibly intuitive way. It uses machine learning and data mining to integrate various components through modular data pipeline concept. Users can drag connection points or drop nodes onto a canvas between activities (Abhishek, & Arvind, 2007).
          b. Benefits
KNIME is powerful analytics with vast native nodes integration in visualization with the easy-to-learn graphical interface. It is scalable and reliable within the infrastructure. It can run R, Python, text mining, chemistry data, etc. for more advanced code driven analysis. KNIME is a free and easy-to-use tool.
          c. Drawback
KNIME’s main disadvantage is the preliminary results are not available while the real pipeline was used, for example, sending and processing single rows right after they are created (Meinl, Cebron, & Gabriel, 2009).
            Figure 8: KNIME displays an analytics process and Platform GUI.
            Source: Adapted from https://www.knime.org/knime-analytics-platform, 2016.

     9. IBM Watson Analytics
          a. Summary
IBM Watson is a question answering computing for machine learning, retrieving information, presenting knowledge, and automatically reasoning. It has a capability to find the correct answer after running a hundred algorithms of proven language analysis. IBM Watson’s applications are often used in financial services, telecommunication, healthcare, and government, and game contests such as Jeopardy (Thomson, 2010). Users are not required to know statistics because IBM Watson computes all in the background. IBM Watson also provides visualization and analysis applications based on the browser with the level skill of 1.
          b. Benefits
IBM Watson is an analytics tool that has an ability to retrieve major information from all documents, provide hidden patterns, insights, and correlations across huge data sets. 80% data are unstructured in various forms such as new articles, online posts, research papers, or organizational system data (Thomson, 2010). It is a free tool.
          c. Drawback
IBM Watson has a slow response to understand the contexts of the clues. In healthcare, IBM Watson assists to identify treatment options for patients, but it has never had a chance in the process of medical diagnosis. It is still in beta stage with more upgrading versions to come. 
            Figure 9: IBM Watson Architecture in a deep question answering computing.
Source: Adapted from https://www.ibm.com/analytics/watson-analytics/us-en/, 2016.

Figure 10: IBM Watson shows high diamond prices below.


Source: Adapted from https://www.ibm.com/analytics/watson-analytics/us-en/, 2016.

     10. Microsoft Power BI (Business Intelligence)
          a. Summary
Microsoft Power BI is a business analytics tool for analyzing data and providing insightful information similar to Excel’s Power Query. It can be used for monitoring the business and sharing timely answers on the convenient dashboards. With drag-and-drop features, Power BI use natural language to report data in a visual format or find good responses to difficult questions (Power BI, 2016).
          b. Benefits
Microsoft Power BI platform provides better price-performance ratio for data visualization (DV) with the most robust set of BI and DV modules such as SQL, Sharepoint, Server, Analytical Reporting and Integration Services, Excel 2010 with PowerPivot add-in.
          c. Drawback
Power BI’s implementation is about average with good scalability and good data integration, but its dashboard support is weak.
            Figure 11: Microsoft Power BI displays airlines’ departure and arrival delays.
            Source: Adapted from http://www.computerworld.com/article/3088958/data-analytics/free-data-visualization-with-microsoft-power-bi-your-step-by-step-guide-with-video.html, 2016.

E. Summary
This document presented a descriptive evaluation report of ten advanced big data visualization tools that were used for answering most of the difficult questions, extracting meaningful information, and revealing insights, hidden patterns or correlations across data in various sources in the data-driven market. They included R Project, Google Fusion Tables, Tableau Public, VIDI, Google Chart Tools, Splunk, Qlikview, KNIME, IBM Watson Analytics, and Microsoft Power BI. For each analytical tool, a descriptive summary, typical features, graphical snapshots of the images, benefits, and drawback were described with highlighted information from a variety of the scholarly resources and credible websites.  
In summary, this document described ten advanced data visualization tools to present big data visualization results and outcomes from the robust analyses in today’s data explosion market.


REFERENCES

Abhishek, T., & Arvind, S. (2007). Workflow based framework for life science informatics. Computational Biology and Chemistry.31 (5-6): 305–319.

Chabot, C. (2014). How to get a 20 million dollar pre-money for series a:tableau software. Retrieved August 22, 2016 from
http://www.sramanamitra.com/2010/03/05/how-to-get-a-20-million-pre-money-valuation-for-series-a-tableau-software-ceo-christian-chabot-part-3/

Dataviz.org, (2016). How it works. Retrieved August 23, 2016 from http://www.dataviz.org/how-it-works

Davis, J. (2016). 2016 Gartner Magic Quadrant for Business Intelligence and Analytics Platforms. Retrieved August 21, 2016 from
http://www.informationweek.com/big-data/software-platforms/gartner-bi-magic-quadrant-inflection-point-has-arrived/d/d-id/1324233

Gartner Group (2016). Gartner BI magic quadrant: inflection point has arrived. Retrieved June 4, 2015 from
http://info.birst.com/AR-Gartner2016CriticalCapabilities_LP.html

Halevy, A., & Shapley, R. (2009). Google fusion tables. Retrieved August 22/2016 from
https://research.googleblog.com/2009/06/google-fusion-tables.html.

Harris, D. (2010). How splunk is riding it search toward an ipo. Retrieved August 23, 2016 from https://gigaom.com/2010/12/17/how-splunk-is-riding-it-search-toward-an-ipo/

Imanuel (2015). 50 big data platforms and big data analytics software. Predictive Analytics Today. Retrieved on November 16, 2015 from http://www.predictiveanalyticstoday.com/bigdata-platforms-bigdata-analytics-software/

Jones, A. (2014). Top 10 data analysis tools for business. Retrieved August 21, 2016 from
http://www.kdnuggets.com/2014/06/top-10-data-analysis-tools-business.html

Konforti, R. (2012). Google chart tools. Retrieved August 2016 from
https://www.drupal.org/project/google_chart_tools.

Lurie, A. (2014). 39 Data Visualization Tools for Big Data | ProfitBricks Blog. Retrieved August 21, 2016 from https://blog.profitbricks.com/39-data-visualization-tools-for-big-data/

Machlis, S. (2011). 22 free tools for data visualization and analysis. ComputerWorld. Retrieved August 8, 2016 from http://www.computerworld.com/article/2507728/enterprise-applications/enterprise-applications-22-free-tools-for-data-visualization-and-analysis.html

Meinl, T., Cebron, N., & Gabriel, T. (2009). The konstanz information miner 2.0. Reterived August 23.2016 from https://kops.uni-konstanz.de/bitstream/handle/123456789/5762/main.pdf;sequence=1

Microsoft Power BI (2016). Bring your data to life. Retrieved August 23, 2016 from https://powerbi.microsoft.com/en-us/?WT.srch=1&WT.mc_id=AID529580_SEM_uDaUULKn&utm_source=Google&utm_medium=CPC&utm_term=microsoft%20power%20bi&utm_campaign=Power_BI&gclid=Cj0KEQjw6O-9BRDjhYXH2bOb8Z4BEiQAWRduk_e-USXb3hqcbKLjs43WZuqXyMhACPamjd1J7Nwju6UaAkL_8P8HAQ

Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big data, big analytics: emerging business intelligence and analytic trends for today's businesses. John Wiley & Sons.

Qlik (2015). Make stunning data discoveries . Retrieved November 2, 2015, from http://www.qlik.com/products/qlik-sense

Thomson, C. (2010). What is i.b.m.’s watson? Retrieved August 23, 2016 from
http://www.nytimes.com/2010/06/20/magazine/20Computer-t.html?_r=0