Large-Scale Data Stream
Processing Engines
Introduction
In Big Data era, an increasing connectivity between users and machines
has generated a massive flow of data in the cloud. The widespread adoption of
multicore processors and the emergent cloud computing leads to concurrent computations
such as MapReduce framework in virtue of simplicity, scalability, and
fault-tolerance to support batch processing tasks. However, mobile devices,
sensor invasion, location services, and real-time network monitoring recently
create a critical demand to build a scalable and parallel framework to process
huge amounts of streamed data. Centralized stream processing has evolved to
decentralized stream engines in distributed queries among machines in a cluster
to achieve scalable streaming data processing in a real-time (Sakr & Gaber,
2014). Some stream processing engines are Aurora, Borealis, IBM system S and
IBM Spade, Deduce, Flux, Limited evaluations, Twitter Storm, StreamCloud, and
Stormy.
This Unit 4 Individual Project document chooses the last
two typical large-scale data streaming processing engines, i.e., StereamCloud
and Stormy, for the discussion that includes specifying primary features,
comparing architecture and scalability, identifying performance bottlenecks,
and analyzing root cause and solutions.
Large-Scale Data Streaming Processing Systems
Some various streaming processing
applications are mobile apps, sensor invasion, Web TV, and real-time network
monitoring. They require real-time processing of data streams such as financial
data analysis, sensor network data, or military command and control. The data
stream is an infinite append-only sequence of tuples such as rows and columns
in the tables. Queries are defined over one of more data streams. Each query
consists of a network of operators that include stateless operations (e.g.,
filter, map, union) and stateful operations (e.g., join, aggregate or
computation over sliding window). Traditional data storage and processing
engines (SPEs) cannot deal with massive volume and low latency requirements.
SPEs have a limitation on network monitoring or fraud detection. They can be
categorized into distributed SPEs that allow distributing queries to individual
nodes, and parallel SPEs that allow same queries on different nodes in
parallel.
StreamCloud is a distributed data
stream processing system to provide transparent query parallelization in the
cloud. Its main features consist of scalability, elasticity, transparency and
portability. StreamCloud can scale on the data stream volume. Its elasticity
allows the system to perform the tasks on the fly with low intrusiveness. Its
transparency allows users to use parallelization of queries without user
intervention. For portability, StreamCloud is independent of underlying SPE
(Gulisano, 2010).
Storming is
a distributed multi-tenant streaming service run on cloud infrastructure. Based
on a synthesis of the cloud computing techniques such as Amazon Dynamo or
Apache Cassandra, Stormy achieves tenancy, scalability, availability,
elasticity, fault tolerance, and particularly streaming as a service (StaaS).For
a multi-tenancy, Stormy can accommodate many users with different workloads.
For scalability, Stormy used distributed hash tables (DHT) to distribute
queries across all machines for efficient data distribution. For high
availability, it uses replications of the queries on several nodes. Stormy’s
architecture is decentralized to achieve no single point of failure. For
elasticity, the nodes under use can be shut down to optimize its own operation
cost. It supports fault-tolerance. Stormy is designed to host a small and
medium size of streaming applications for StaaS.
Comparison
Equivalent to dataflow programming
or event stream processing, streaming processing takes advantages of a limited
form of parallel processing in applications in
Compute intensity, data parallelism, and data locality
(Beard, 2015). Both large-scale data streaming processing engines StereamCloud
and Stormy can be compared under three following criteria:
System architecture
In
StreamCloud engine, the basic principle is to define a point at time t such
that the tuples of bucket earlier than time t are processed by
the old subscriber and the tuples after than time t are processed by a new
subscriber. The queries are split into subqueries, and each subquery is
allocated to a set of instances grouped in a subscriber. The elastic management
architecture of the StreamCloud (SC) consists of local managers (LM) that
monitor resource utilization and new load. Each LM run by each SC instance can
reconfigure the local load balancer (LB) and reports monitoring information to
the elastic manager (EM). The EM can decide toreconfigure to balance the load,
to provision, or decommission the SC instances on each subscriber. EM also
works with a resource manager (RM) to maintain a pool of instances without
queries. When the EM receives a new instance, a subquery is issued, and the
instance is added to subcluster as shown in Figure 1 below:
Figure 1: The elastic
management architecture of StreamCloud.
Source: Adapted from
Gulisano et al., 2012.
Legendary:
Agg : Aggregate
F : Filter
IM : Instance
message
LB : Load balancer
M : Map
U : Union
For the
Stormy engine, its architecture is designed in failure-less decentralized with
the assumption that a query can be done on one node. A number of the incoming
events of a stream is limited. A query graph, called a directed acyclic graph
(DAG), consists of three components (e.g., (1) external data stream input, (2)
queries, and (3) external output destination that can be registered separately.
Every data stream is identified by a unique stream identifier (SID). Stormy
maps the incoming and outgoing SIDs to build internal representation of the
query graph based on four key API functions: (1) registerStream( ), (2) registerQuery(
), (3) registerOutput( ), and pushEvents( ). The Stormy architecture includes
input nodes to receive input event streams organized by the Stormy client
library. When queries are registered, the event streams can be executed and
routed through the system specified by the query graph. The results are pushed
to the output destinations as shown in Figure 2.
Figure 2: The Stormy
architecture with the stream execution.
Source: Adapted from Loesing
et al., 2012.
Legendary:
E : Registered
stream event
N : Node
Q : query
With the
description of two different architectures between StramCloud and Stormy above,
the StreamCloud splits the queries into different subqueries whose a set of
instance grouped in a subscriber for execution with the elastic manager that
balances the load, or to provision or decommission instances. StreamCloud uses
elasticity rules to set threshold, a condition that triggers provisioning,
decommissioning or load balancing. On the other hand, Stormy uses query graph
to execute the registered stream event on multiple nodes then provide output in
different destination output nodes after processing is complete. It appears
that data flow in the StreamCloud engine is complex due to interactions between
subscribers with high latency in the process. The data flow in the Stormy
process is smoother once the query graph is defined and registered.
Performance optimization capability
StreamCloud
has a capability of performance optimizations in two ways. For example, one of
them is the system can merge stateful subqueries if these subqueries share the
same aggregation method. The second way is the system can merge in union with
IM, filter with a load balancer (Gulisano, Jiménez-Peris, Patiño-Martínez,
Soriente & Valduriez, 2012). In
addition, if the average CPU utilization is above the upper utilization
threshold (UUT) or below the lower utilization threshold (LUT), the system
triggers provisioning and decommissioning the instances.in a new configuration
to maintain the average CPU utilization between LUT and UUT. StreamCloud uses a
load-ware provisioning strategy by taking the current subscriber size and load
to reach the target utilization threshold.
For performance optimization such
as overload situation, Stormy applies two key techniques: (1) load balancing,
and (2) cloud bursting. Based on its current utilization, a node can decide to
balance load locally by continuously measuring its resource utilization
including CPU, disk space, memory, and network consumption. While every node
makes load balancing, the entire system reacts efficiently to load variations.
Cloud bursting occurs when the overall load of the system becomes too high by
adding a new node to the system. The node that is elected as cloud bursting
leader will decide to perform a cloud burst by initiating the cloud bursting
procedure for adding a new node. Conversely, if the overall load is too low,
the cloud bursting leader may decide to drop a node from the system.
Scalability
To achieve
scalability, StreamCLoud uses several strategies. One of them is query cluster
strategy by allowing a full query is allocated to a subscriber of nodes.
Communication is performed across nodes. Another strategy is an
operator-cluster strategy where each operator is assigned to a set of nodes
with communication between nodes of one sub-cluster to the next. The third is
subquery-cluster strategy such that a stateful operator is followed by stateless
operators, or the whole query if there is no stateful operator available.
Subquery-cluster strategy minimizes the number of communication steps and
minimizes fan out the cost. Scalability can be performed by parallelization of
stateful subqueries or join and aggregating each input stream split by LB into
N substreams; Scalability can be done with Cartesian Join where each tuple is
sent to many nodes.
Stormy achieves scalability by
using DHTs that map events to queries rather than map keys to data. It scales
in two dimensions: (1) number of queries, (2) number of streams. Stormy is
designed to execute more and more queries by adding more nodes to the system.
It also supports more and more streams by also increasing more nodes. That
means Stormy can support any number of customers.
Comparing scalability between those
large-scale data processing engines, StreamCloud uses the more complex
technique with different strategies to achieved scalability than Stormy relies
on the numbers of queries and streams for adding more nodes in the system.
Performance Bottlenecks
With continuous data
transfer at high speed, performance bottlenecks often occur, especially in
large-scale data processing. The bottleneck effect leads situation when the
system stops acquiring data streams, potentially missing important events or it
waits for the clear signal for string information (Spectrum, 2016).
According to Gulisano (2012), both centralized and distributed SPEs have
the limitation of a bottleneck because they concentrate data streams to single
nodes. Executing queries on a centralized SPE processes all data at the same
node. In distributed SPEs such as StreamCloud, a single query also processes
data stream at the single machine. This is the single-node bottleneck because
the processing capacity is bound to the capacity of a single node that also
implies that centralized or distributed SPEs do not scale out.
For Stormy, a performance bottleneck causes limitation in the data flow. If
the volume of data flow is higher than the bandwidth of various nodes, the
performance bottleneck may occur (Techopedia, 2017). To ensure consistent node
leader election in the Stormy process, Apache Zookeeper is added in the system.
Performance bottleneck occurs. However, it is proved that bottleneck does not
belong to Stormy.
Root Cause Analysis
For
StreamCloud, different operators of a single query are used on different nodes.
A single machine processes one data stream such that all the input data stream
is routed to the machine that runs the first operator of the query, the stream
generated by the first operator is concentrated to the machine running the
second operator, and so on. This is the single-node bottleneck.
For Stormy,
it uses distributed hash tables for efficient data distribution on reliable
routing in flexible load balancing. It appears that the performance bottleneck
or network bottleneck rarely occurs. Also, the Stormy data processing system is
still in development.
Possible Solutions.
With root
cause analysis on StreamCloud’s performance bottleneck on a single node, the
bottleneck can be avoided by processing data streams on multiple nodes.
Alternatively, peak detection firmware that can locate maximum or minimum
events within data streams with their corresponding timing information is
deployed in StreamCloud system to prevent bottleneck occurrences.
It appears
that performance bottleneck is not a serious issue in the Stormy data stream
processing system. Therefore, the performance bottleneck can be a further
research study.
Conclusion
This Unit 4 Individual Project
presented two real-time data stream processing engines, i.e., StreamCloud and
Stormy. Their main features such as multi-tenancy, scalability, availability,
elasticity, fault-tolerance, or Streaming as a Service were discussed at the
high level description. The document also compared those large-scale data processing
engines and identified the performance bottleneck. The root cause of
performance bottleneck of each engine was provided, and finally, high-level
strategies to remove bottlenecks were proposed.
REFERENCES
Beard, J. (2015).
A short into to stream processing. Retrieved February 26, 2017 from
http://www.jonathanbeard.io/blog/2015/09/19/streaming-and-dataflow.html
Gulisano, V.
(2010). StreamCloud: a large scale data stream system. Retrieved February 26,
2017 from http://www.cl.cam.ac.uk/~ey204/teaching/ACS/R202/presentation/S5/Rocky_STREAMCLOUD.pdf
Gulisano, V.
(2012). Streamcloud: an elastic parallel-distributed stream processing engine –
thesis. Retrieved February 27, 2017 from https://tel.archives-ouvertes.fr/tel-00768281/file/Vincenzo_thesis.pdf
Gulisano, V.,
Jiménez-Peris, R., Patiño-Martínez, M., Soriente, C. & Valduriez, P. (2012). Streamcloud: an elastic and scalable data
streaming system. IEEE Trans. Parallel Distrib. Syst., 23(12):2351–2365.
Loesing. S.,
Hentschel. M., Kraska, T., & Kossmann., D. (2012). Stormy: an elastic and
highly available streaming service in the cloud. Retrieved February 26, 2017
from
https://pdfs.semanticscholar.org/f552/3767b5dd7aa125f22e30054c1b3efc8c6a92.pdf
Sakr, S., &
Gaber, M. (Eds.). (2014). Large scale and big data: processing and management.
Boca Raton, FL: CRC Press.
Spectrum (2016).
Data transfer bottlenecks. Retrieved February 27, 2017 from http://spectrum-instrumentation.com/en/solving-data-transfer-bottleneck-digitizers
Techopedia
(2017). Network bottleneck. Retrieved February 27, 2017 from
https://www.techopedia.com/definition/24819/network-bottleneck
No comments:
Post a Comment