Monday, March 27, 2017

A Survey Report on Big Data Security Analytics Tools

A Survey Report on Big Data Security Analytics Tools

Introduction
In a data-driven era, big data is available and ubiquitous; particularly it becomes a unique and valuable asset for the organizations due to underlying meaningful information. Despite big data’s complexity, big data’s massive size and storage are challenging issues to many organizations that usually set up an analytical process to distil big data for knowledge or wisdom (Sakr & Gaber, 2014). Recently, cloud computing with its cloud services (e.g., PaaS, IaaS, SaaS, etc.) becomes an emerging computing model for processing Big Data due to flexibility, availability, and affordability. However, clouds have major security issues in association with data’s confidentiality, integrity, availability, privacy, and applications outsourced to the cloud.
The cloud computing systems that are similar to distributed computing systems encounter the complex security attacks in large scale such as eavesdropping, masquerading, message tampering, replaying the messages, and denial of services (Khan, Yaqoob, Hashem, Inayat, Ali, Alam, Shiraz & Gani, 2014). The existing methods and tools such as ManageEngine, BlackSrtatus, SolarWinds, etc. from the first generation of security information and event management (SIEM) technologies share common functionality similar to the previous generation tools but also have individual specific capabilities. Nevertheless, they are unable to detect and solve complex advanced persistent threats (APTs) (Gartner, 2016). Also, there is a lack of good survey to the existing scalable and intelligent security analytics tools on the new technical functions and suitable applications on the targeted security problems.
This Unit 5 Individual Project document will provide a research survey on four scalable and intelligent security analytics tools that focuses on four advanced big data security analytics tools, describes their main functions of APT detection in the cloud computing environment and key design considerations for scalability, and discusses pros and cons of each advanced security analytics tool.

Scalable and intelligent security analytics tools

The urgent need for early detection, prevention, or at least reduction of the APTs or complex cyberattacks on organizational databases, particularly cloud computing systems, drives a new generation of advanced big data security analytics tools. Gartner’s report of “Magic Quadrant for Security Information and Event Management” published in August, 2016 displayed 14 big data security analytics tools in the magic quadrant for SIEM technologies in Figure 1 where the least performance players are ManageEngine, BlackSTratus,SolarWinds, etc. and the best performance leaders are IBM QRadar, Splunk, LogRhythm, etc.
Source: Adapted from Gartner’s SIEM Report, 2016.
    
Four of the most advanced scalable and intelligent security analytics tools chosen for this survey report are IBM QRadar, Splunk, LogRhythm, and HPE ArcSight.

Main functions
            The SIEM technologies form event data such as log data, NetFlow, network packets. Event data includes contextual information about users, assets, threats and vulnerabilities in SIEM market. The major need of the organizations in both public and private sectors leads to a new security information and event management market of analyzing event data in real time for early detection of data breaches and targeted attacks. The SIEM activities include capturing, collecting, storing, investigating, and reporting on log data for forensics, incident response, and regulatory compliance. The main functions of the top advanced big data security analytics tools are described as follows:
     IBM’s QRadar’s main functions
            QRadar is an IBM’s security intelligence platform that can be implemented in both physical and virtual appliances, and infrastructure as a service (IaaS) in the cloud, and QRadar itself as a service fully managed by IBM Managed Security Services team. Qradar can collect and process log data, security event, NetFlow, and monitor network traffic by using deep-packet inspection and full-packet capture.  It can perform behavior analysis for all supported data sources (Gartner, 2016). QRadar collects log data from an enterprise, OS (operating systems), host assets, applications, user activities, and performs real-time analysis for malicious activity to stop, prevent or mitigate the damage to the company (Scarfone, 2017).
     Splunk’s main functions
            Splunk is a security intelligence platform including Splunk Enterprise to provide log and event collection, search and visualization in the Splunk query language, and Splunk Enterprise Security (ES) to add security-specific SIEM features. Splunk Enterprise (SE) enables data analysis used for IT operations, business intelligence, application performance management. SE can combine with ES for monitoring security events and analysis. Splunk ES supports predefined dashboards, search, the rule of correlation, visualizations, and reports for real-time security monitoring, and incident response. Splunk ES and SE can be deployed in the public or private clouds, or as a hybrid for software as a service (SaaS) (Gartner, 2016). Splunk has the capability to identify potential quickly to trigger human or automated response to stop the attacks before they are completed.    
     LogRhythm’s main functions
            LogRhythm’s SIEM platform can be used as an appliance with software as virtual instance format to support an n-tier scalable decentralized architecture. LogRhythm provides endpoint and network forensic capabilities of a system process, file integrity, monitoring NetFlow, full-packet capture. It also combines events, endpoint, and network monitoring for the integrated incident response, automated response capabilities. Platform Manager, AI Engine, Data Processors, Data Indexers and Data Collectors can be a consolidated all-in-one for high performance in APTs detection.
     HPE ArcSight’s main functions
HPE ArcSight SIEM platform consists of (1) the ArcSight Data Platform (ADP) for log collection, management and reporting; (2) ArcSight Enterprise Security Management (ESM) software for large-scale security monitoring deployments; and (3) ArcSight Express, an appliance-based all-in-one offering a design for preconfigured monitoring and reporting, as well as simplified data management. ArcSight Connectors, Management Center and Logger in ArcSight Data Platform provide log management, data collection, etc. ArcSight modules can perform entity behavior analytics, malware detection, and threat intelligence.

Tool assessment
            Cloud computing is an ability to control the computation dynamically in cost affordability, and ability to enable users to utilize the computation without managing the underlying complexity of the technology (Open Cloud Manifesto, 2012). With special system features like virtual machines, trust asymmetry, semitransparent system architecture, etc., the cloud computing system has special security issues. Sakr and Gaber (2014) recognized at least ten (10) security issues in the list below:
1. Exploitation of Co-tenancy
2. Secure Architecture for the Cloud
3. Accountability for Outsourced Data
4. Confidentiality of Data and Computation
5. Privacy
6. Verifying Outsourced Computation
7. Verifying Capability
8. Cloud Forensics
9. Misuse Detection
10. Resource Accounting and Economic Attacks
With four advanced big data security analytics IBM QRadar, Splunk, LogRhythm, and HPE ArcSight are they suitable to address these security issues in the cloud computing environment?     
     IBM QRadar’s assessment
            IBM QRadar includes QRadar Log Manager, Data Node, SIEM, Risk Manager, Vulnerability Manager, Qflow and Vflow Collectors, and Incident Forensics in its utility. These features will cover several security issues such as monitoring security events, log information for anomaly detection, cloud forensics, misuse detection, confidentiality of data and computation. IBM X-Force Exchange is used to share threat intelligence, and QRadar Application Framework supports IBM Security App Exchange. For an incident response, QRadar uses Resilient Systems. Also, it can monitor a single security event and response. QRadar covers the security issues (The list numbers: 1, 2, 3, 4, 6, 8, 9, and 10). Thus, IBM QRadar is suitable to be used in the cloud computing environment. 
     Splunk’s assessment
            Splunk’s architecture comprises streaming input, Forwarders that ingest data, Indexers that index and store raw machine logs, and Search Heads that provide data access via the Web-based GUI. Splunk provides incident management, workflow capabilities, improved visualization, and monitoring IaaS, and SaaS providers. It focuses on security event monitoring, analysis use cases and threat intelligence. Splunk covers cloud security issues (The list numbers: 2, 4, 5, 7, 8, and 10). Therefore, Splunk is a good security analytics tool in cloud computing.
     LogRhythm’s assessment
            LogRhythm has capabilities of log processing, indexing, and expands unstructured search with clustered full data replication. It improves the risk-based prioritization scoring algorithm and protocols for Network Monitor. It supports cloud services such as AWS, Box, Okta, and integrations with cloud access security broker solutions like Microsoft’s Cloud App Security and Zscaler. LogRhythm covers cloud security issues (The list numbers: 2, 3, 5 , 6, 8, and 10). LogRhythm is also a popular security analytics tool for cloud computing.
     HPE ArcSight’s assessment
            HPE ArcSight is a SIEM platform for midsize organizations, enterprises, and service providers. It provides DNS malware detection, threat intelligence, and extends the SIEM’s capabilities. Its SIEM architecture and licensing model that use interface to control incoming events, incidents, and traffic analysis. It supports midsize SIEM deployments for extensive third-party connector support. HPE ArcSight covers cloud security issues (The list numbers: 2, 4, 6, 7, 8, and 9) and it is a good fit for large-scale deployments such as cloud computing environment.

Targeted applications
            In order to cope with sophisticated attacks, big data analytics approach with advanced analytical methods is applied on the collected huge data sets from various systems. The big data security analytics approach can answer the question of “why will big data analytics help solve hard-to-solve security problems?” because of at least three reasons. First of all, the attacker will have no way to know exactly what is stored in the big data storage such as cloud repositories. As a result, the attacker is not sure how to conduct the attacks on the data sets. Secondly, analytical methods may change over time. This prevents the attacker from eluding these analytical methods. Thirdly, some of forensic marks or spots of the attacks cannot be removed from the system. Conversely, forensic spots are evidence for security analytics tools’ detection (Elovici, 2014). Therefore, the advanced big data security analytics tools can be applied in most of the fields, for example, Web application, financial application, insider threats analysis, full-spectrum fraud detection, and Internet-scale botnet discovery.
The advanced big data security analytics tools should address the problems of security and analysis. Figure 2 illustrates the non-linear relationship between effectiveness and cost of the security analytics tools.
Figure 2:
Source: Adapted from Dr. Cross’s live chat lecture, 2017.

     IBM QRadar’s applications
            IBM's QRadar Security Intelligence Platform with the main functions of physical and virtual appliances and IaaS can be deployed for multitenant capabilities in applications such as cloud computing systems, Web applications, healthcare fraud detection, healthcare such as health monitoring and patch management in help check framework (ScienceSoft, 2016). QRadar is an advanced security intelligence to detect about 80% to 90% of the APTs.
     Splunk’s applications
            Splunk security intelligence platform with specific SIEM features provides accurate data analysis on-premises hybrid (both public and private) clouds in Web applications, insider threats analysis, data streaming processing engines, and monitoring IaaS, and SaaS providers. Splunk owns a SIEM platform with flexibility for various data sources and analytics capabilities of machine learning and UEBA functionality. Similarly to IBM QRadar’s effectiveness performance, Splunk’s performance is also in the range of 80% to 90%.
     LogRhythm’s applications
            LogRhythm is a SIEM solution to midsize and large enterprises with applications in forensic capabilities, file integrity, and integrated incident response. LgRhythm is good for a Web application, insider threats analysis, routine security automation, APTs detection, and supporting AWS, Box and Okta, and cloud access security broker solutions. Its effectiveness performance is likely in the same range as QRadar and Splunk are. 
     HPE ArcSight’s applications
            HPE ArcSight SIEM platform is used by midsize organizations and service providers. It can be used in Security Operations Center (SOC). Its applications include Web applications, insider threats, Internet-scale botnet discovery, user behavior analysis,out-of-the-box third-party technology connectors and integration, financial services, and DNS malware analytics.  
Primary design for scalability
            Scalability is a dominant subject in distributed computing, particularly cloud computing. A distributed system is scalable if it remains effective when a large number of users, data, or resources are increased significantly. For example, a cloud system has an ability to expand the hardware such as adding more commodity computers. Scalability has two ways from scale-up or scale-out. Most of cloud computing systems prefer to scale out because they can accommodate more data, users, hardware, software, and services, improve system bandwidth, performance, and reduce bottlenecks, delay, or latency. Scalability is used to overcome (1) some parts of programs such as initialization parts that can never be parallelized, and (2) load imbalance among tasks is likely high. The key design consideration for scalability of these chosen advanced big data security analytics tools is discussed below:
     IBM QRadar’s scalability
            The IBM QRadar’s architecture includes event processors for processing event data and event collectors for capturing and forwarding data. It has deployment options range from all-in-one appliance implementations or scaled appliance implementations using separate appliances for discrete functions for scalability. Integrated appliance version 3105 provides up to 5000 events per second, 200,000 flows per minutes, 6.2 terabytes storage. Integrated appliance 3128 provides up to 15,000 events per second, 300,000 flows per minutes, 40 terabytes storage (Scarfone, 2017). QRadar can collect log events and network flow data from cloud-based applications
     Splunk’s scalability
            Splunk applies a modern big data platform that enables users to scale and solve a wide range of security use cases for SOC (security operations and compliance). It provides flexible deployment options such as using on-premises, in the cloud or hybrid environments depending on the workloads and use cases. Splunk has fairly limited capabilities of scalability because it requires a separate infrastructure with different license model from Splunk Enterprise and Enterprise Security (Scarfone, 2015).
     LogRhythm’s scalability
            LogRhythm helps customers to detect and respond quickly to cyber threats before a material breach occurs. It provides compliance automation and IT predictive intelligence. Its architecture is decentralized and scalable into n-tiers. Components of Logrhythm’s architecture consist of Platform Manager, AI Engine, Data Processors, Data Indexers and Data Collectors that can scale up into more tiers for more workloads (Symtrex, 2016). Its scalability allows access to the data for analytics and reporting in iterative approach for incident investigation.
     HPE ArcSight’s scalability
            HPE ArcSight focuses on big data security analytics and intelligence software for SIEM and log management solutions. It is designed to assist customers to identify and prioritize security threats, simplify audit and compliance activities, and organize incident response activities (Morgan, 2010). For midsize SIEM deployments with extensive third-party connector support, HPE ArcSight is a good fit for organizations in building a dedicated SOC.

Pros and cons
            The most important characteristics of the SIEM products are the ability to access and combine data and information from multiple sources, and the ability to perform intelligent queries on that data and information (IT Central Station, 2016).            According to Gartner (2016), the advanced big data security analytics tools, e.g., QRadar, Splunk, LogRhythm, and ArcSight, have some pros and cons in security information and event management.
     IBM QRadar
            For the pros, QRadar supports the visual view of log and event data in network flow and packets, asset data and vulnerability, and threat intelligence. It can analyze network traffic behavior for correlation through NetFlow and log events. Its modular architecture is designed to support security event and monitoring logs in IaaS environments, AWS CloudTrail, and SoftLayer. QRadar can be deployed and maintained easily in either an all-in-one appliance, a large-tiered, or multisite environment. The IBM Security App Exchange can integrate capabilities from the third-party technologies into the SIEM dashboards, investigation and response workflow. Users find that dashboards are the most helpful for an overview of traffic flow and issues. Built-in rules and report are comprehensive and do work (Gartner, 2016).           
            For the cons, QRadar can monitor endpoint for threat detection and response, or basic file integrity but it needs third-party technologies. The integration of IBM vulnerability management add-on receives mixed success from the users. IBM sale engagement process is complex and requires persistence. Multiple Java versions for deployment setup are inconvenient (IT Central Station, 2016).
     Splunk
            For the pros, Splunk’s monitoring use cases in security drives significant visibility for users. Splunk UBA with more advanced methods provides advanced security analytics capabilities in native machine learning functionality and integration. It includes the essential features to monitor threat detection and inside threat use cases. In IT operations, monitoring solutions provides security teams with in-house experience on existing infrastructure and data. the application of log’s events for business needs is helpful. Operational intelligence is fast and available across several servers.
For the cons, Splunk Enterprise Security supports basic predefined correlations for user monitoring and reporting requirements only while other vendors provide richer content for use cases. Splunk license models are using gigabyte data volume indexed per day. This solution has the higher cost than other SIEM products where high data volumes are expected. It recommends sufficient planning and prioritization of data sources to avoid overconsuming licensed data volumes. Splunk UBA requires a separate infrastructure and leverages a licensed model different from Splunk Enterprise and Enterprise Security licensed models. Setting up and adding new data resources should be improved. Operational workflow and ticketing systems should be suitable for security operation center.
     LogRhythm
            For the pros, SIEM capabilities, endpoint monitoring, network forensics, UEBA and incident management capabilities are combined in LogRhythm to support security operations and advanced threat monitoring use cases. LogRhythm provides dynamic context integration and security monitoring workflows with a highly interactive and user experience. Its emerging automated response capabilities can execute actions on remote devices. LogRhythm’s solution is easy to deploy and maintain effective out-of-box use case and workflows. It is still very visible in the competitive SIEM technology. LogRhthm creates a good feedback loop to allow users to see off-limits activities. AI engine is the most valuable feature. Out-of-the-box is very easy and intuitive to get started.
            For the cons, even though LogRhythm has the integrated security capabilities like System Monitor and Network Monitor to enable synergies across IT from the deeper integration with the SIEM, users with critical IT and network operations should evaluate them versus related point solutions. Its custom report engine requires improvement. LogRhythm that has fewer sales and channel resources than other leading SIEM vendors may lead to fewer choices for resellers. The client must be installed on the computer for all of the functions to work.
     HPE ArcSight
            For the pros, HPE ArcSight supplies a complete set of SIEM capabilities used to support a large-scale SOC. Its User Behavior Analytics provides full UBA capabilities along with SIEM. It has various out-of-box third-party technology connectors and integrations. HPE ArcSight reduces the amount of time in the investigation. As a result, it becomes one of the best at ingestions of events.  It has very stable system components such as connectors, logger, and correlation engines.
For the cons, HPE ArcSight proposals are more professional services than comparable offerings. Its ESM is complex and expensive to deploy, configure and operate that other IBM QRadar, Splunk, etc. ArcSight makes it in top four of the advanced big data security analytics tools, but it decreases visibility for new installs and increasing competitive replacements. HPE spins a development effort to redo the core technology platform. Users should take cautions in development plans for their needs.  
Deployment of HPE ArcSight is complicated and expensive. In custom applications, users need some expertise in configuring ArcSight software. Correlation rules should be simplified.

Conclusion
This Unit 5 Individual Project presented the concept of the SIEM technologies in advanced big data security analytics to detect, prevent, and mitigate the advanced persistent threats and data breaches from smarter hackers. It makes sense to integrate security data, improve incident detection/response and security operations, and then move on to integrating the myriad of security management consoles, middleware, and enforcement points afterward (Oltsik, 2014).

The leading security analytics tools, i.e., IBM QRadar, Splunk. LogRhythm and HPE ArcSight were under report in Gartner Quadrant (2016). Their primary functions such as anomaly detections, event correlation, real-time analytics, etc. were discussed on each advanced tool. The assessment of these tools was performed for their suitability and appropriateness that could be used in the cloud computing environment. The effective security analytics tools for the targeted applications and security problems were studied. The key design was considered for scalability by each tool. Also, the strengths and weaknesses of each advanced big data security analytics tool were explained briefly in comparison in this document.  

REFERENCES

Elovici, Y. (2014). Detecting cyberattacks using big data security analytics. Retrieved March 01, 2017 from https://www.youtube.com/watch?v=bP_1X-392pU  
Gartner (2016). Garner’s complete analysis in the siem 2016 magic quadrant. Retrieved March 7, 2017 from https://logrhythm.com/2016-gartner-magic-quadrant-siem-report/?utm_source=google&utm_medium=cpc&utm_campaign=SIEM-Search-US&AdGroup=SIEM-General&utm_region=NA&utm_language=en.
IT Central Station (2016). Compare ibm security qradar, splunk, hpe arcsight , and logrhythm. Retrieved March 13, 2017 from http://www.csoonline.com/article/3067716/network-security/siem-review-splunk-arcsight-logrhythm-and-qradar.html?upd=1489463430396
Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z. Ali, W. K. M., Alam, M., Shiraz, M., & Gani., A. (2014). Big data: Survey, technologies, opportunities, and challenges. The Scientific World Journal, 2014. Retrieved from http://www.hindawi.com/journals/tswj/2014/712826/
Morgan, T. (2010). "HP eyes $1.46bn ArcSight security buy: Hey, Dell. Wanna bid higher?". The Register.
Oltsik, J. (2014). Big data security analytics can become the nexus of information security integration. NetworkWorld. Retrieved from http://www.networkworld.com/article/2361840/security0/big-data-security-analytics-can-become-the-nexus-of- information-security-integration.html
Open Cloud Manifesto (2012). Open cloud manifesto google group. Retrieved March 12, 2017 from https://groups.google.com/forum/#!forum/opencloud.
Symtrex (2016). Logrythm. Retrieved March 13, 2017 from http://www.symtrex.com/security-solutions/logrhythm/?gclid=Cj0KEQjwhpnGBRDKpY-My9rdutABEiQAWNcslI_0E0TnnNdgbPw24A3vq_b_YZ35qRPH2WBYbZg3TugaAhfY8P8HAQ
Sakr, S., & Gaber, M. (Eds.). (2014). Large scale and big data: processing and management. Boca Raton, FL: CRC Press.
Scarfone, K. (2017). Ibm security qradar: siem product review. Retrieved March 13, 2017 from http://searchsecurity.techtarget.com/feature/IBM-Security-QRadar-SIEM-product-overview
Scarfone, K. (2015). Splunk enterprise: siem product overview. Retrieved March 13, 2017 from http://searchsecurity.techtarget.com/feature/Splunk-Enterprise-SIEM-product-overview.

ScienceSoft (2016). Health check framework. Retrieved March 12, 2017 from https://www.scnsoft.com/services/security-intelligence-services/health-check-framework-for-ibm-qradar-siem.























Friday, March 24, 2017

Large-Scale Data Stream Processing Engines

Introduction
                In Big Data era, an increasing connectivity between users and machines has generated a massive flow of data in the cloud. The widespread adoption of multicore processors and the emergent cloud computing leads to concurrent computations such as MapReduce framework in virtue of simplicity, scalability, and fault-tolerance to support batch processing tasks. However, mobile devices, sensor invasion, location services, and real-time network monitoring recently create a critical demand to build a scalable and parallel framework to process huge amounts of streamed data. Centralized stream processing has evolved to decentralized stream engines in distributed queries among machines in a cluster to achieve scalable streaming data processing in a real-time (Sakr & Gaber, 2014). Some stream processing engines are Aurora, Borealis, IBM system S and IBM Spade, Deduce, Flux, Limited evaluations, Twitter Storm, StreamCloud, and Stormy.
This Unit 4 Individual Project document chooses the last two typical large-scale data streaming processing engines, i.e., StereamCloud and Stormy, for the discussion that includes specifying primary features, comparing architecture and scalability, identifying performance bottlenecks, and analyzing root cause and solutions.

Large-Scale Data Streaming Processing Systems
Some various streaming processing applications are mobile apps, sensor invasion, Web TV, and real-time network monitoring. They require real-time processing of data streams such as financial data analysis, sensor network data, or military command and control. The data stream is an infinite append-only sequence of tuples such as rows and columns in the tables. Queries are defined over one of more data streams. Each query consists of a network of operators that include stateless operations (e.g., filter, map, union) and stateful operations (e.g., join, aggregate or computation over sliding window). Traditional data storage and processing engines (SPEs) cannot deal with massive volume and low latency requirements. SPEs have a limitation on network monitoring or fraud detection. They can be categorized into distributed SPEs that allow distributing queries to individual nodes, and parallel SPEs that allow same queries on different nodes in parallel.
StreamCloud is a distributed data stream processing system to provide transparent query parallelization in the cloud. Its main features consist of scalability, elasticity, transparency and portability. StreamCloud can scale on the data stream volume. Its elasticity allows the system to perform the tasks on the fly with low intrusiveness. Its transparency allows users to use parallelization of queries without user intervention. For portability, StreamCloud is independent of underlying SPE (Gulisano, 2010).    
            Storming is a distributed multi-tenant streaming service run on cloud infrastructure. Based on a synthesis of the cloud computing techniques such as Amazon Dynamo or Apache Cassandra, Stormy achieves tenancy, scalability, availability, elasticity, fault tolerance, and particularly streaming as a service (StaaS).For a multi-tenancy, Stormy can accommodate many users with different workloads. For scalability, Stormy used distributed hash tables (DHT) to distribute queries across all machines for efficient data distribution. For high availability, it uses replications of the queries on several nodes. Stormy’s architecture is decentralized to achieve no single point of failure. For elasticity, the nodes under use can be shut down to optimize its own operation cost. It supports fault-tolerance. Stormy is designed to host a small and medium size of streaming applications for StaaS.     

Comparison
Equivalent to dataflow programming or event stream processing, streaming processing takes advantages of a limited form of parallel processing in applications in
Compute intensity, data parallelism, and data locality (Beard, 2015). Both large-scale data streaming processing engines StereamCloud and Stormy can be compared under three following criteria:
     System architecture
            In StreamCloud engine, the basic principle is to define a point at time t such that the tuples of bucket earlier than time t are processed by the old subscriber and the tuples after than time t are processed by a new subscriber. The queries are split into subqueries, and each subquery is allocated to a set of instances grouped in a subscriber. The elastic management architecture of the StreamCloud (SC) consists of local managers (LM) that monitor resource utilization and new load. Each LM run by each SC instance can reconfigure the local load balancer (LB) and reports monitoring information to the elastic manager (EM). The EM can decide toreconfigure to balance the load, to provision, or decommission the SC instances on each subscriber. EM also works with a resource manager (RM) to maintain a pool of instances without queries. When the EM receives a new instance, a subquery is issued, and the instance is added to subcluster as shown in Figure 1 below:  
Figure 1: The elastic management architecture of StreamCloud.
Source: Adapted from Gulisano et al., 2012.
Legendary:
Agg : Aggregate
F : Filter
IM : Instance message
LB : Load balancer    
M : Map
U : Union
            For the Stormy engine, its architecture is designed in failure-less decentralized with the assumption that a query can be done on one node. A number of the incoming events of a stream is limited. A query graph, called a directed acyclic graph (DAG), consists of three components (e.g., (1) external data stream input, (2) queries, and (3) external output destination that can be registered separately. Every data stream is identified by a unique stream identifier (SID). Stormy maps the incoming and outgoing SIDs to build internal representation of the query graph based on four key API functions: (1) registerStream( ), (2) registerQuery( ), (3) registerOutput( ), and pushEvents( ). The Stormy architecture includes input nodes to receive input event streams organized by the Stormy client library. When queries are registered, the event streams can be executed and routed through the system specified by the query graph. The results are pushed to the output destinations as shown in Figure 2.
Figure 2: The Stormy architecture with the stream execution.
Source: Adapted from Loesing et al., 2012.
Legendary:
E : Registered stream event
N : Node
Q : query
            With the description of two different architectures between StramCloud and Stormy above, the StreamCloud splits the queries into different subqueries whose a set of instance grouped in a subscriber for execution with the elastic manager that balances the load, or to provision or decommission instances. StreamCloud uses elasticity rules to set threshold, a condition that triggers provisioning, decommissioning or load balancing. On the other hand, Stormy uses query graph to execute the registered stream event on multiple nodes then provide output in different destination output nodes after processing is complete. It appears that data flow in the StreamCloud engine is complex due to interactions between subscribers with high latency in the process. The data flow in the Stormy process is smoother once the query graph is defined and registered.
     Performance optimization capability
            StreamCloud has a capability of performance optimizations in two ways. For example, one of them is the system can merge stateful subqueries if these subqueries share the same aggregation method. The second way is the system can merge in union with IM, filter with a load balancer (Gulisano, Jiménez-Peris, Patiño-Martínez, Soriente &  Valduriez, 2012). In addition, if the average CPU utilization is above the upper utilization threshold (UUT) or below the lower utilization threshold (LUT), the system triggers provisioning and decommissioning the instances.in a new configuration to maintain the average CPU utilization between LUT and UUT. StreamCloud uses a load-ware provisioning strategy by taking the current subscriber size and load to reach the target utilization threshold.
For performance optimization such as overload situation, Stormy applies two key techniques: (1) load balancing, and (2) cloud bursting. Based on its current utilization, a node can decide to balance load locally by continuously measuring its resource utilization including CPU, disk space, memory, and network consumption. While every node makes load balancing, the entire system reacts efficiently to load variations. Cloud bursting occurs when the overall load of the system becomes too high by adding a new node to the system. The node that is elected as cloud bursting leader will decide to perform a cloud burst by initiating the cloud bursting procedure for adding a new node. Conversely, if the overall load is too low, the cloud bursting leader may decide to drop a node from the system.   
     Scalability
            To achieve scalability, StreamCLoud uses several strategies. One of them is query cluster strategy by allowing a full query is allocated to a subscriber of nodes. Communication is performed across nodes. Another strategy is an operator-cluster strategy where each operator is assigned to a set of nodes with communication between nodes of one sub-cluster to the next. The third is subquery-cluster strategy such that a stateful operator is followed by stateless operators, or the whole query if there is no stateful operator available. Subquery-cluster strategy minimizes the number of communication steps and minimizes fan out the cost. Scalability can be performed by parallelization of stateful subqueries or join and aggregating each input stream split by LB into N substreams; Scalability can be done with Cartesian Join where each tuple is sent to many nodes. 
Stormy achieves scalability by using DHTs that map events to queries rather than map keys to data. It scales in two dimensions: (1) number of queries, (2) number of streams. Stormy is designed to execute more and more queries by adding more nodes to the system. It also supports more and more streams by also increasing more nodes. That means Stormy can support any number of customers.    
Comparing scalability between those large-scale data processing engines, StreamCloud uses the more complex technique with different strategies to achieved scalability than Stormy relies on the numbers of queries and streams for adding more nodes in the system.

Performance Bottlenecks
            With continuous data transfer at high speed, performance bottlenecks often occur, especially in large-scale data processing. The bottleneck effect leads situation when the system stops acquiring data streams, potentially missing important events or it waits for the clear signal for string information (Spectrum, 2016).
According to Gulisano (2012), both centralized and distributed SPEs have the limitation of a bottleneck because they concentrate data streams to single nodes. Executing queries on a centralized SPE processes all data at the same node. In distributed SPEs such as StreamCloud, a single query also processes data stream at the single machine. This is the single-node bottleneck because the processing capacity is bound to the capacity of a single node that also implies that centralized or distributed SPEs do not scale out.
For Stormy, a performance bottleneck causes limitation in the data flow. If the volume of data flow is higher than the bandwidth of various nodes, the performance bottleneck may occur (Techopedia, 2017). To ensure consistent node leader election in the Stormy process, Apache Zookeeper is added in the system. Performance bottleneck occurs. However, it is proved that bottleneck does not belong to Stormy.

Root Cause Analysis
            For StreamCloud, different operators of a single query are used on different nodes. A single machine processes one data stream such that all the input data stream is routed to the machine that runs the first operator of the query, the stream generated by the first operator is concentrated to the machine running the second operator, and so on. This is the single-node bottleneck.
            For Stormy, it uses distributed hash tables for efficient data distribution on reliable routing in flexible load balancing. It appears that the performance bottleneck or network bottleneck rarely occurs. Also, the Stormy data processing system is still in development.

Possible Solutions.
            With root cause analysis on StreamCloud’s performance bottleneck on a single node, the bottleneck can be avoided by processing data streams on multiple nodes. Alternatively, peak detection firmware that can locate maximum or minimum events within data streams with their corresponding timing information is deployed in StreamCloud system to prevent bottleneck occurrences.  
            It appears that performance bottleneck is not a serious issue in the Stormy data stream processing system. Therefore, the performance bottleneck can be a further research study. 

Conclusion
This Unit 4 Individual Project presented two real-time data stream processing engines, i.e., StreamCloud and Stormy. Their main features such as multi-tenancy, scalability, availability, elasticity, fault-tolerance, or Streaming as a Service were discussed at the high level description. The document also compared those large-scale data processing engines and identified the performance bottleneck. The root cause of performance bottleneck of each engine was provided, and finally, high-level strategies to remove bottlenecks were proposed.


REFERENCES

Beard, J. (2015). A short into to stream processing. Retrieved February 26, 2017 from
http://www.jonathanbeard.io/blog/2015/09/19/streaming-and-dataflow.html
Gulisano, V. (2010). StreamCloud: a large scale data stream system. Retrieved February 26, 2017 from http://www.cl.cam.ac.uk/~ey204/teaching/ACS/R202/presentation/S5/Rocky_STREAMCLOUD.pdf
Gulisano, V. (2012). Streamcloud: an elastic parallel-distributed stream processing engine – thesis. Retrieved February 27, 2017 from https://tel.archives-ouvertes.fr/tel-00768281/file/Vincenzo_thesis.pdf
Gulisano, V., Jiménez-Peris, R., Patiño-Martínez, M., Soriente, C. &  Valduriez, P. (2012).  Streamcloud: an elastic and scalable data streaming system. IEEE Trans. Parallel Distrib. Syst., 23(12):2351–2365.
Loesing. S., Hentschel. M., Kraska, T., & Kossmann., D. (2012). Stormy: an elastic and highly available streaming service in the cloud. Retrieved February 26, 2017 from
https://pdfs.semanticscholar.org/f552/3767b5dd7aa125f22e30054c1b3efc8c6a92.pdf
Sakr, S., & Gaber, M. (Eds.). (2014). Large scale and big data: processing and management. Boca Raton, FL: CRC Press.
Spectrum (2016). Data transfer bottlenecks. Retrieved February 27, 2017 from http://spectrum-instrumentation.com/en/solving-data-transfer-bottleneck-digitizers
Techopedia (2017). Network bottleneck. Retrieved February 27, 2017 from  https://www.techopedia.com/definition/24819/network-bottleneck











Saturday, March 18, 2017

Scalable Stochastic Model for IaaS Cloud Computing

A STUDY REPORT OF SCALABLE STOCHASTIC MODEL 

Introduction
                Today, cloud computing becomes more popular among users, and the market of the large cloud computing services is intensely competitive among large IaaS (Infrastructure as a Service) cloud providers such as Amazon AWS, Google. Cloud provided service’s performance is a decisive factor for the service contract. There are three choices for a cloud performance evaluation: (1) performance analysis based on discrete event simulation, (2) performance analysis based on experiment, and (3) performance analysis based on the stochastic model. The first two choices are inefficient in time and cost due to a large number of computer resources used in simulation and experiment (Duke University, 2013). The third choice, i.e., the stochastic model appears a low-cost option with time-effectiveness. However, the stochastic model has scalability problem due to the large size and complexity of a cloud system, particularly a high growth of the model states and system components (Sakr & Gaber, 2014). This Unit 3 Individual Project’s document will describe the main concept of the stochastic model, address the proposed stochastic model based on three-pool cloud architecture, its limitation, and solutions, then discuss the potential applications of this stochastic model.
Main Concept
               To improve the performance of the large IaaS cloud computation and services on hardware, software, workload and management characteristics, Sakr and Gaber (2014) proposed a simpler and scalable stochastic model from Markov’s large stochastic model that tends to have many modeling states and a lot of system components. Sakr and Gaber’s simplified stochastic model focuses on lower cost due to fewer computer resources, and less response time delays due to no simulation or experiment. It has two primary features, i.e., scalability and tractability, and uses a three-pool cloud architecture that includes some interacting sub-models such as RPDE (Resource Provisioning Decision Engine) sub-model, pool sub-models, and VM provisioning sub-models as shown in Figure 1 below.
Figure 1: Interactions among the sub-models
Source: Adapted from Sakr and Gaber, 2014.

The simplified stochastic model offers a unique closed-form solution with stochastic modeling software packages, e.g., SHARPE or SPNP by using fixed-point iteration (Ghosh, Longo, Naik & Trivedi. 2012; Hirel, Tuffin & Trivedi, 2000; Trivedi & Sahner, 2009).
Scalable three-pool cloud architecture

Since processing large-scale data IaaS cloud computing is complex, Markov’s stochastic architectural model that use a large-size hardware and sophisticated software. It is difficult to scale up and scale out the Markov’s system. Sakr and Gaber’s simplified stochastic model uses three-pool cloud architecture with scalability feature. In the three-pool cloud architecture, physical machines (PMs) such as computer resources are grouped into three hot, warm and code pools that depend on power consumption and response time. The hot pool includes the PMs that are on and in a ready state where VMs (virtual machines) wait for configuration for user request with maximum power consumption and the least response time. The warm pool consists of PMs in sleep mode that waits for next run. And the cold pool is PMs that are in Off state with minimum power and the longest elapsed response time (Ghosh, Naik & Trivedi, 2011). The RPDE tries to find an available PM in the hot, warm, or cold pool respectively to accept the service request as shown in Figure 2 below. Service request rejection probability and provisioning response delay are two elements in performance analysis metrics.        
Figure 2: A three-pool cloud architecture with a service request for a PM.
Source: Adapted from Sakr and Gaber, 2014.

The simplified stochastic three-pool cloud architecture-based model has a key advantage of scalability. It has a set of independent sub-models, and each sub-model can represent each PM in the pool. Some VMs running on a PM is kept track by those sub-models. The interactions among sub-models promote the overall model to become scalable and tractable. The characteristic of interacting sub-models makes the simplified stochastic system easily scale out by adding the new computer(s). Non-zero entries and reduction of some states allow solution time to be reduced significantly. As a result, users use this scalable and tractable stochastic model to detect performance bottlenecks, do what-if analysis, or plan capacity of the system (Sakr & Gaber, 2014). For example, overall mean response delay time, mean queuing delay, or bottleneck component shifts since the mean service time and PMs in the pool are changed. Furthermore, grouping the PMs or computer resources into multiple pools allow the provider monitors power consumption and response time to adjust service offering to clients and users.
Limitation of the three-pool cloud architecture
            The main limitation of the three-pool cloud architecture model is that the service request may be rejected. The service request rejection can occur when the entrant buffer is full, or the resource provisioning decision engine cannot find an available PM due to insufficient capacity. In most of the cases, all PMs in the hot, warm or cold pool are not available with increasing response delay time significantly. To remove this limitation, Sakr and Gaber (2014) use Markov approach of the expected steady-state reward rate for the probability of the service request (Trivedi, 2001). Sakr and Gaber’s analysis shows that the longer mean service time is, the higher service request rejection probability is. With increasing the PM capacity, service request rejection probability is decreased.
Another limitation of the three-pool cloud architecture model is the service requests’ characteristic. Service requests must be homogeneous, and PMs are also homogeneous. They should be alike, identical or the same type. Also, each service request is required for one unique VM instance with specific RAM, CPU, and disk capacity. Otherwise, the simplified three-pool cloud architecture model will not work properly. Notice that VM provisioning sub-models can be extended to heterogeneous PMs. These heterogeneous PMs can be categorized into multiple classes, and each class can be represented by a pool such that PMs can be homogeneous within a pool but heterogeneous across the pools.
Potential applications
            Using SHARPE Portal (Symbolic Hierarchical Automated Reliability and Performance Evaluator), users can compute (1) service request rejection probability and (2) mean response delay time for solving the interacting sub-models (Duke University, 2017). The service request rejection probability is proportional to the mean service time. The simplified stochastic model can be used to perform what-if analysis for the entire system. In the mean response delay, bottlenecks may shift while the mean service time and cloud capacity or PMs in each pool are varied. The accuracy and scalability of the simplified stochastic model can be computed in a variant of stochastic Petri net such as SRN (stochastic reward net) (Hirel, Tuffin & Trivedi, 2000). Stochastic Petri Net Package (SPNP) can be used to solve the monolithic SRN model. The reduced number of states and nonzero entries in the interacting sub-models promotes a decrease in solution time.
Scalability of the simplified stochastic model can be applied on modeling the large IaaS pools in Big Data processing environments. Notice that the model in the analysis uses three-pools. The approach can be expanded more than three pools. One CTMC (Continuous Time Markov Chain) is needed for a VM provisioning sub-model for each pool. The simplified stochastic model can scale linearly with numbers of pools in the cloud. As addressed previously, the homogeneous PMs can be expanded to heterogeneous PMs categorized into different classes such that PMs are homogeneous within a pool but heterogeneous across the pools. The VM provisioning sub-models can be expanded into parallel provisioning of the service requests. Based on the simplified stochastic model, new software packages and analytical tools can be developed for Big Data applications.
Conclusion
In three choices for a cloud performance evaluation, the last choice of the simplified, scalable stochastic model was one of the most favorite solutions in effective cost and time-effectiveness. The main concept of the simplified stochastic model proposed by Sakr and Gaber (2014) was discussed thoroughly. The simplified stochastic model’s scalability and practice were discussed in performance analysis on IaaS cloud computing environment. The document also discussed the main limitations and provided some techniques to remove the limitations. Also, the potential applications of the simplified stochastic model are provided in depth to the audience for future research.

REFERENCES

Duke University (2013). Tutorial on the sharpe interface. Retrieved February 14, 2017 from http://sharpe.pratt.duke.edu/node/4.
Duke University (2017). Sharpe portal. Retrieved February 13, 2017 from https://sharpe.pratt.duke.edu/
Ghosh, R. (2012). Scalable stochastic model for cloud services. Retrieved February 13, 2017 from http://dukespace.lib.duke.edu/dspace/bitstream/handle/10161/6110/Ghosh_duke_0066D_11619.pdf;sequence=1
Ghosh, R., Longo, F., Naik, V. & Trivedi. K. (2012). Modeling and performance analysis
of large scale iaas clouds. Elsevier Future Generation Computer Systems. Retrieved February 14, 2017 from http://dx.doi.org/10.1016/j.future.2012.06.005.
Ghosh, R., Naik, V. & Trivedi,K. (2011). Power-performance trade-offs in iaas cloud: a scalable analytic approach. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Workshop on Dependability of Clouds, Data Centers and Virtual Computing Environments (DCDV), pp. 152–157, Hong Kong, China.
Hirel, C., Tuffin, B. & Trivedi. K. (2000). SPNP: stochastic petri nets. Version 6. In
International Conference on Computer Performance Evaluation: Modelling Techniques and Tools (TOOLS 2000), B. Haverkort, H. Bohnenkamp (eds.), Lecture Notes in Computer Science 1786, Springer Verlag, pp. 354–357, Schaumburg, IL.
Sakr, S., & Gaber, M. (Eds.). (2014). Large scale and big data: processing and management. Boca Raton, FL: CRC Press.
Trivedi, K. (2001). Probability and statistics with reliability, queuing and computer science applications, second edition. Wiley.

Trivedi, K. & Sahner, R. (2009). Sharpe at the age of twenty two. ACM Sigmetrics Performance Evaluation Review, 36(4):52–57.