Friday, March 24, 2017

Large-Scale Data Stream Processing Engines

Introduction
                In Big Data era, an increasing connectivity between users and machines has generated a massive flow of data in the cloud. The widespread adoption of multicore processors and the emergent cloud computing leads to concurrent computations such as MapReduce framework in virtue of simplicity, scalability, and fault-tolerance to support batch processing tasks. However, mobile devices, sensor invasion, location services, and real-time network monitoring recently create a critical demand to build a scalable and parallel framework to process huge amounts of streamed data. Centralized stream processing has evolved to decentralized stream engines in distributed queries among machines in a cluster to achieve scalable streaming data processing in a real-time (Sakr & Gaber, 2014). Some stream processing engines are Aurora, Borealis, IBM system S and IBM Spade, Deduce, Flux, Limited evaluations, Twitter Storm, StreamCloud, and Stormy.
This Unit 4 Individual Project document chooses the last two typical large-scale data streaming processing engines, i.e., StereamCloud and Stormy, for the discussion that includes specifying primary features, comparing architecture and scalability, identifying performance bottlenecks, and analyzing root cause and solutions.

Large-Scale Data Streaming Processing Systems
Some various streaming processing applications are mobile apps, sensor invasion, Web TV, and real-time network monitoring. They require real-time processing of data streams such as financial data analysis, sensor network data, or military command and control. The data stream is an infinite append-only sequence of tuples such as rows and columns in the tables. Queries are defined over one of more data streams. Each query consists of a network of operators that include stateless operations (e.g., filter, map, union) and stateful operations (e.g., join, aggregate or computation over sliding window). Traditional data storage and processing engines (SPEs) cannot deal with massive volume and low latency requirements. SPEs have a limitation on network monitoring or fraud detection. They can be categorized into distributed SPEs that allow distributing queries to individual nodes, and parallel SPEs that allow same queries on different nodes in parallel.
StreamCloud is a distributed data stream processing system to provide transparent query parallelization in the cloud. Its main features consist of scalability, elasticity, transparency and portability. StreamCloud can scale on the data stream volume. Its elasticity allows the system to perform the tasks on the fly with low intrusiveness. Its transparency allows users to use parallelization of queries without user intervention. For portability, StreamCloud is independent of underlying SPE (Gulisano, 2010).    
            Storming is a distributed multi-tenant streaming service run on cloud infrastructure. Based on a synthesis of the cloud computing techniques such as Amazon Dynamo or Apache Cassandra, Stormy achieves tenancy, scalability, availability, elasticity, fault tolerance, and particularly streaming as a service (StaaS).For a multi-tenancy, Stormy can accommodate many users with different workloads. For scalability, Stormy used distributed hash tables (DHT) to distribute queries across all machines for efficient data distribution. For high availability, it uses replications of the queries on several nodes. Stormy’s architecture is decentralized to achieve no single point of failure. For elasticity, the nodes under use can be shut down to optimize its own operation cost. It supports fault-tolerance. Stormy is designed to host a small and medium size of streaming applications for StaaS.     

Comparison
Equivalent to dataflow programming or event stream processing, streaming processing takes advantages of a limited form of parallel processing in applications in
Compute intensity, data parallelism, and data locality (Beard, 2015). Both large-scale data streaming processing engines StereamCloud and Stormy can be compared under three following criteria:
     System architecture
            In StreamCloud engine, the basic principle is to define a point at time t such that the tuples of bucket earlier than time t are processed by the old subscriber and the tuples after than time t are processed by a new subscriber. The queries are split into subqueries, and each subquery is allocated to a set of instances grouped in a subscriber. The elastic management architecture of the StreamCloud (SC) consists of local managers (LM) that monitor resource utilization and new load. Each LM run by each SC instance can reconfigure the local load balancer (LB) and reports monitoring information to the elastic manager (EM). The EM can decide toreconfigure to balance the load, to provision, or decommission the SC instances on each subscriber. EM also works with a resource manager (RM) to maintain a pool of instances without queries. When the EM receives a new instance, a subquery is issued, and the instance is added to subcluster as shown in Figure 1 below:  
Figure 1: The elastic management architecture of StreamCloud.
Source: Adapted from Gulisano et al., 2012.
Legendary:
Agg : Aggregate
F : Filter
IM : Instance message
LB : Load balancer    
M : Map
U : Union
            For the Stormy engine, its architecture is designed in failure-less decentralized with the assumption that a query can be done on one node. A number of the incoming events of a stream is limited. A query graph, called a directed acyclic graph (DAG), consists of three components (e.g., (1) external data stream input, (2) queries, and (3) external output destination that can be registered separately. Every data stream is identified by a unique stream identifier (SID). Stormy maps the incoming and outgoing SIDs to build internal representation of the query graph based on four key API functions: (1) registerStream( ), (2) registerQuery( ), (3) registerOutput( ), and pushEvents( ). The Stormy architecture includes input nodes to receive input event streams organized by the Stormy client library. When queries are registered, the event streams can be executed and routed through the system specified by the query graph. The results are pushed to the output destinations as shown in Figure 2.
Figure 2: The Stormy architecture with the stream execution.
Source: Adapted from Loesing et al., 2012.
Legendary:
E : Registered stream event
N : Node
Q : query
            With the description of two different architectures between StramCloud and Stormy above, the StreamCloud splits the queries into different subqueries whose a set of instance grouped in a subscriber for execution with the elastic manager that balances the load, or to provision or decommission instances. StreamCloud uses elasticity rules to set threshold, a condition that triggers provisioning, decommissioning or load balancing. On the other hand, Stormy uses query graph to execute the registered stream event on multiple nodes then provide output in different destination output nodes after processing is complete. It appears that data flow in the StreamCloud engine is complex due to interactions between subscribers with high latency in the process. The data flow in the Stormy process is smoother once the query graph is defined and registered.
     Performance optimization capability
            StreamCloud has a capability of performance optimizations in two ways. For example, one of them is the system can merge stateful subqueries if these subqueries share the same aggregation method. The second way is the system can merge in union with IM, filter with a load balancer (Gulisano, Jiménez-Peris, Patiño-Martínez, Soriente &  Valduriez, 2012). In addition, if the average CPU utilization is above the upper utilization threshold (UUT) or below the lower utilization threshold (LUT), the system triggers provisioning and decommissioning the instances.in a new configuration to maintain the average CPU utilization between LUT and UUT. StreamCloud uses a load-ware provisioning strategy by taking the current subscriber size and load to reach the target utilization threshold.
For performance optimization such as overload situation, Stormy applies two key techniques: (1) load balancing, and (2) cloud bursting. Based on its current utilization, a node can decide to balance load locally by continuously measuring its resource utilization including CPU, disk space, memory, and network consumption. While every node makes load balancing, the entire system reacts efficiently to load variations. Cloud bursting occurs when the overall load of the system becomes too high by adding a new node to the system. The node that is elected as cloud bursting leader will decide to perform a cloud burst by initiating the cloud bursting procedure for adding a new node. Conversely, if the overall load is too low, the cloud bursting leader may decide to drop a node from the system.   
     Scalability
            To achieve scalability, StreamCLoud uses several strategies. One of them is query cluster strategy by allowing a full query is allocated to a subscriber of nodes. Communication is performed across nodes. Another strategy is an operator-cluster strategy where each operator is assigned to a set of nodes with communication between nodes of one sub-cluster to the next. The third is subquery-cluster strategy such that a stateful operator is followed by stateless operators, or the whole query if there is no stateful operator available. Subquery-cluster strategy minimizes the number of communication steps and minimizes fan out the cost. Scalability can be performed by parallelization of stateful subqueries or join and aggregating each input stream split by LB into N substreams; Scalability can be done with Cartesian Join where each tuple is sent to many nodes. 
Stormy achieves scalability by using DHTs that map events to queries rather than map keys to data. It scales in two dimensions: (1) number of queries, (2) number of streams. Stormy is designed to execute more and more queries by adding more nodes to the system. It also supports more and more streams by also increasing more nodes. That means Stormy can support any number of customers.    
Comparing scalability between those large-scale data processing engines, StreamCloud uses the more complex technique with different strategies to achieved scalability than Stormy relies on the numbers of queries and streams for adding more nodes in the system.

Performance Bottlenecks
            With continuous data transfer at high speed, performance bottlenecks often occur, especially in large-scale data processing. The bottleneck effect leads situation when the system stops acquiring data streams, potentially missing important events or it waits for the clear signal for string information (Spectrum, 2016).
According to Gulisano (2012), both centralized and distributed SPEs have the limitation of a bottleneck because they concentrate data streams to single nodes. Executing queries on a centralized SPE processes all data at the same node. In distributed SPEs such as StreamCloud, a single query also processes data stream at the single machine. This is the single-node bottleneck because the processing capacity is bound to the capacity of a single node that also implies that centralized or distributed SPEs do not scale out.
For Stormy, a performance bottleneck causes limitation in the data flow. If the volume of data flow is higher than the bandwidth of various nodes, the performance bottleneck may occur (Techopedia, 2017). To ensure consistent node leader election in the Stormy process, Apache Zookeeper is added in the system. Performance bottleneck occurs. However, it is proved that bottleneck does not belong to Stormy.

Root Cause Analysis
            For StreamCloud, different operators of a single query are used on different nodes. A single machine processes one data stream such that all the input data stream is routed to the machine that runs the first operator of the query, the stream generated by the first operator is concentrated to the machine running the second operator, and so on. This is the single-node bottleneck.
            For Stormy, it uses distributed hash tables for efficient data distribution on reliable routing in flexible load balancing. It appears that the performance bottleneck or network bottleneck rarely occurs. Also, the Stormy data processing system is still in development.

Possible Solutions.
            With root cause analysis on StreamCloud’s performance bottleneck on a single node, the bottleneck can be avoided by processing data streams on multiple nodes. Alternatively, peak detection firmware that can locate maximum or minimum events within data streams with their corresponding timing information is deployed in StreamCloud system to prevent bottleneck occurrences.  
            It appears that performance bottleneck is not a serious issue in the Stormy data stream processing system. Therefore, the performance bottleneck can be a further research study. 

Conclusion
This Unit 4 Individual Project presented two real-time data stream processing engines, i.e., StreamCloud and Stormy. Their main features such as multi-tenancy, scalability, availability, elasticity, fault-tolerance, or Streaming as a Service were discussed at the high level description. The document also compared those large-scale data processing engines and identified the performance bottleneck. The root cause of performance bottleneck of each engine was provided, and finally, high-level strategies to remove bottlenecks were proposed.


REFERENCES

Beard, J. (2015). A short into to stream processing. Retrieved February 26, 2017 from
http://www.jonathanbeard.io/blog/2015/09/19/streaming-and-dataflow.html
Gulisano, V. (2010). StreamCloud: a large scale data stream system. Retrieved February 26, 2017 from http://www.cl.cam.ac.uk/~ey204/teaching/ACS/R202/presentation/S5/Rocky_STREAMCLOUD.pdf
Gulisano, V. (2012). Streamcloud: an elastic parallel-distributed stream processing engine – thesis. Retrieved February 27, 2017 from https://tel.archives-ouvertes.fr/tel-00768281/file/Vincenzo_thesis.pdf
Gulisano, V., Jiménez-Peris, R., Patiño-Martínez, M., Soriente, C. &  Valduriez, P. (2012).  Streamcloud: an elastic and scalable data streaming system. IEEE Trans. Parallel Distrib. Syst., 23(12):2351–2365.
Loesing. S., Hentschel. M., Kraska, T., & Kossmann., D. (2012). Stormy: an elastic and highly available streaming service in the cloud. Retrieved February 26, 2017 from
https://pdfs.semanticscholar.org/f552/3767b5dd7aa125f22e30054c1b3efc8c6a92.pdf
Sakr, S., & Gaber, M. (Eds.). (2014). Large scale and big data: processing and management. Boca Raton, FL: CRC Press.
Spectrum (2016). Data transfer bottlenecks. Retrieved February 27, 2017 from http://spectrum-instrumentation.com/en/solving-data-transfer-bottleneck-digitizers
Techopedia (2017). Network bottleneck. Retrieved February 27, 2017 from  https://www.techopedia.com/definition/24819/network-bottleneck











No comments:

Post a Comment