Thursday, February 23, 2017

A Survey Report on RDF Data Query Processing

A Survey Report on RDF Data Query Processing

Introduction
The third generation of the World Wide Web (Web 3.0) for semantic affiliation or knowledge content uses Resource Description Framework (RDF) for conceptual foundation and model for the information in Web resources. RDF model that can represent data in a format of machine-readable form has been popularly used in many Web 3.0 applications of knowledge management. The structure of RDF expression is a set of triples or a directed labeled RDF graph. Each triple consists of (1) subject, (2) predicate, and (3) object. Querying RDF triplets is similar to perform large join operations that traditional relational computing is unable to query a large number of star join operations properly. Uniform Resource Identifier (URI) is usually used in RDF to identify a unique Web resource (Brickley & Guha, 2014). SPARQL (Simple Protocol and RDF Query Language) is an RDF query language or semantic query language that can retrieve and manipulate RDF data (W3C, 2004). However, querying large-scale RDF data sets encounters some difficult problem. For instance, the computation of SPARQL queries on large-scale Web data sets requires several joins between subsets of data that cause challenges in code programming. Also, the traditional approaches of the single-place machine cannot scale up or scale out due to increasing available RDF data.
In 2004, Google’s MapReduce framework with a parallel and distributed algorithm as part of the solutions opened up different ways to improve RDF query performance’s efficiency. For example, Amazon’s EC2 (Elastic Compute Cloud), YARS2, 4store are a few solutions. In this survey report, two promising solutions (1) PigSPARQL and (2) MAPSIN to improve RDF query performance will be compared and evaluated based on four criteria: main focus, key technical changes, rationale, and analysis of pros and cons of each solution.

Main focus of the solutions
     1. PigSPARQL

SPARQL recommended by the W3C is standard query language for a large number of RDF datasets. It uses RDF triples whose subject, predicate and object are variables in a sequence of MapReduce iterations of mapping, shuffling, sorting, and reducing tasks. The challenge is joining data sets properly in either map-side or reduce-side in SPARQL query or RDF graph. For reduce-side, join computation is inefficient in selective joins and requires a great deal of network bandwidth. On the map-side, the merge joins are difficult to cascade, and some advantage such as avoiding shuffle is lost (Ghemawat, Gobioff & Leung, 2003). Pig Latin developed by Yahoo! Research is an Apache Hadoop-based language for vast scale data set analysis. For Hadoop, Pig is a high-prioritized project for automatic translation a Pig Latin program into MapReduce jobs. A translation of SPARQL to Pig Latin ensures the main focus that SPARQL query processing on a cluster of MapReduce framework with the advantage of performance enhancement and newer Hadoop versions support from further developing Apache Pig with minimum programming code changes (Blanas, Patel, Ercegovac, Rao, Shekita & Tian, 2010). PigSPARQL is RDF query translation technique to translate complex SPARQL queries to Pig Latin on a MapReduce cluster. Fig 1 illustrates a high-level modular translation process. 
Fig 1: A modular translation process
Source: Adapted from Schatzle et al., 2011.
     2. MAPSIN
Based on HBase’s indexing capabilities, MAPSIN (Map-Side Index Nested Loop Join) improves selective query performance by maintaining the reduce-side joins’ flexibility while using the map-side join’s effectiveness without the change in the framework. Its main focus is to arrange MAPSIN joins with NoSQL HBase’s indexing capabilities for scalable joins on Hadoop MapReduce framework in multiway joins and one-pattern queries. HBase, that is a NoSQL column family or column-oriented database integrated well into Hadoop, can store arbitrary RDF graphs. RDF storage schema defines data modeling’s literature for RDF data in semantic expressions. HBase becomes an extra storage layer on top of HDFS for access data randomly almost at the real time where HDFS does not have this ability. MAPSIN computes the join between two triple patterns in merging compatible mappings in a single map phase only on transferring the needed data. Fig 2 illustrates a typical RDF graph and SPARQL query.
Fig 2: RDF graph and SPARQL query
Source: Adapted from Schatzle et al., 2012.

Main technical changes
The key technical changes in PigSPARQL and MAPSIN are discussed as follows:
     1. PigSPARQL
With RDF format data, SPARQL query language, MapReduce model, and Pig Latin implementation, PiqSPARQL translate complex SPARQL queries through algebraic presentations such as syntax tree, algebra tree, Pig Latin program to MapReduce jobs. Notice that an SPARQL query is addressed on the algebra level and SPAQRL algebra’s expression is interpreted as a tree that is bottom-up evaluated. Query processing time in regarding the size of the RDF data, a feature of MapReduce framework, is linear scalability (Schatzle, Przyjaciel-Zablocki & Lausen, 2011).
     2. MAPSIN
With HBase's indexing capabilities, MAPSIN join uses join technique by computing the join between two triple patterns for output in a single map phase with minimum transferred data. Triple patterns are cascaded in chains for computations of mapping in the iteration of the MAPSIN join. MAPSIN joins’ performance and the HBase's number of index lookups are tightly correlated. It is crucial to minimize the number of index lookups for optimization (Schatzle, Przyjaciel-Zablocki, Dorner, Hornung, & Lausen, 2012).  
Rationale for the technical changes
            The rationale for the technical changes in PigSPARQL and MAPSIN is explained below:
     1. PigSPARQL 
Querying RDF datasets at web-scale is difficult due to the requirement of several joins between data subsets in SPARQL computing and the single-place machine technique cannot scale to meet large RDF data. MapReduce framework with its well scalable properties becomes attractive for SPARQL in the Apache Hadoop platform.
By extracting information from a large RDF dataset then transforming and loading the extracted data into a different format, it appears that cluster-based parallelism outperforms parallel databases. PigSPARQL offers not only RDF data query transformation but also a scalable implementation of the entire ETL-process on a MapReduce cluster. PiqSPARQL provides good performance and excellent scalability for complex analytical queries, but it suffers from poor performance for selective queries because PiqSPAQRL does not have adequately built-in index structures and redundant data shuffling in join computation in the reduce phase (Schatzle et al., 2011).
     2. MAPSIN
According to Schatzle et al. (2012), MAPSIN join takes advantage of the distributed NoSQL HBase’s indexing capabilities to increase selective queries’ performance. HBase as top layer storage on HDFS can process data for MAPSIN joins in the map phase to avoid costly data shuffling. MAPSIN join algorithm can do joins cascade in MapReduce iteration. For consecutive joins, there is no need an additional shuffle and reduce phase to pre-process the data. Especially, MAPSIN joins require no changes in the underlying MapReduce frameworks. Schatzle’s evaluation indicated that significant improvement of selective queries over the common reduce-side join. 

Analysis of pros and cons
            Performance between PigSPARQL (with SPARQL and Pig Latin) and MAPSIN (with HBase) is analyzed. Their pros and cons are discussed as follows:
     1. PigSPARQL
          a. Pros
            PigSPARQL is a translation framework of RDF data queries from SPARQL to Pig Latin on MapReduce cluster without code changes or overhead management. It takes advantage of parallel processing of the large-scale datasets. PigSPARQL provides transformation part in scalable implementation in ETL-based applications. PigSPARQL approach is easier to achieve and handle than direct mapping into MapReduce framework (Afrati & Ullman, 2011). PigSPARQL’s translation uses several optimization strategies effectively. PigSPARQL’s performance and its scaling properties are competitive for complex analytical queries.
          b. Cons
            PigSPARQL performs poorly on selective queries. It does not have built-in index structures. In the reduce phase, it does data shuffling unnecessarily in join computation. 
     2. MAPSIN
          a. Pros
            MAPSIN join takes advantage of the distributed NoSQL database HBase’s indexing capabilities to improve selective queries. It eliminates the costly data shuffling and increases selective queries over the common reduce-side join. HBase can store a space-efficient RDF data schema with favorable access characteristics. With HBase, there are no shuffling join partitions across the network. Users can access the relevant join partners in each iteration. The reduce phase is not used in MAPSIN join. Also, there is no change in the MapReduce framework or Hadoop platform. Map-side joins are much more efficient than reduce-side joins. HBase and MapReduce’s combination allows cascading a sequence of MAPSIN joins without sorting and repartitioning the intermediate output for next iteration (Dean & Ghemawat, 2008). The number of MapReduce iterations and HBase requests are reduced in the multiway optimization. Overall, the MAPSIN join approach outperforms the reduce-side join for selective queries with an improvement of the total query execution times significantly.
          b. Cons
            MAPSIN joins require some strict precondition that makes them difficult to utilize in the sequence of joins.

Conclusion
In summary, the survey report provided two wide-used approaches, i.e., PigSPAQRL and MAPSIN, based on MapReduce framework and Hadoop platform to process RDF data queries in large-scale data sets. The survey report identified the primary focus and described each high-level approach with associated SPAQRL, Pig Latin, HBase, MapReduce framework, and Hadoop HDFS.  The rationale of these technical changes in each solution was provided. The survey report also offered the analysis of pros and cons of PigSPAQRL and MAPSIN. A question that was brought up is how to incorporate MAPSIN joins into PigSPAQRL in a complimentarily hybrid solution for dynamic join method on pattern selectivity and statistics at data loading time.  

REFERENCES

Afrati, F. & Ullman, J. (2011). Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298.
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J. & Tian, Y. (2010). A comparison of join algorithms for log processing in mapreduce. In: SIGMOD.
Brickley, D., & Guha, R. V. (eds). (2014, February 25). RDF Schema 1.1. W3C. Retrieved from http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
Dean, J. & Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113.
Ghemawat, S., Gobioff, H. & Leung, S.T. (2003). The google file system. in: proc. sosp, pp. 29–43.
Sakr, S., & Gaber, M. (Eds.). (2014). Large scale and big data: processing and management. Boca Raton, FL: CRC Press.
W3C (The World Wide Web Consortium). (2004). Resource description framework:
concepts and abstract syntax, W3C recommendation.
Schatzle, A., Przyjaciel-Zablocki, M., Dorner, C., Hornung, T. & Lausen, G. (2012). Cascading map-side joins over HBase for scalable join processing. Retrieved January 29, 2017 from http://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper5.pdf
Schatzle, A., Przyjaciel-Zablocki, M. & Lausen, G. (2011). Pigsparql: mapping sparql to pig latin. Retrieved January 31, 2017 from
http://www.csd.uoc.gr/~hy561/papers/storageaccess/largescale/PigSPARQL-%20Mapping%20SPARQL%20to%20Pig%20Latin.pdf










Thursday, February 9, 2017

Performance Bottleneck on the Web
By TS Le


Introduction
In the early 1960s, the Internet was developed to link various computer networks that were able to function as a connected system that would not be disrupted by local disasters (Brookshear, 2012). In 1989, Berners-Lee introduced the World Wide Web (the Web) that allowed users to access resources in the form of interlinked hypertext language from any place at any time through the media Internet (Jacobs, & Walsh, 2004). Resources are conventional on the hypertext Web to describe Web pages, images, advertisements, product catalogs. The Web has evolved into five generations: (1) Web 1.0 for information connection, (2) Web 2.0 for social connection or verbalization, (3) Web 3.0 for knowledge content or semantic affiliation, (4) Web 4.0 for symbiotic integration, and (5) Web 5.0 for decentralized smart communication (Patel, 2013).
With the rapid Web growth, the data on the Web also increases exponentially and become more diverse, complex, ubiquitous, and sophisticated (Vermesan, & Friess, 2013). Accessing large-scale Web data sets for various applications pose challenges over the unique social-technical Web in all generations. This document will focus on the potential performance bottlenecks and report on the main sources of Web data, possible locations of bottleneck, analyzing root causes, and propose the solutions.
Data Sources
The Web is a unique socio-technical system that employs the inaction between humans, e.g., users’ behaviors and technological  networks, e.g., complex infrastructures.
It is about joint optimization such as interrelatedness of social and technical aspects of an organization or the society as a whole (Trist, & Bamforth, 1951). The Web invented by Berners-Lee (1989) has evolved rapidly over five generations:
     - Web 1.0 is the first implementation of the read-only web for information connection by 1996. Mostly data is the documents such as HTML documents, web pages owned by companies and organizations. The sources of these documents come from the Web-oriented organizations.  People consume these documents.
     - Web 2.0 is the second generation of the read/write Web to connect billions of users in 2006. It is also categorized as the Web of verbalization that managed and assembled large global crowd for common interests in social interaction. Data are Web pages, social messages, emails, texts, etc. whose sources of data are shared by communities or large groups of users. People publish the content and others consume it. Companies such as YouTube, Blogger, Wikipedia, etc. build and provide a framework for users who are data sources of providing the content.
     - Web 3.0 is the third generation of the Semantic Web for data integration established in 2016. It is read/write and shared data used by trillions of users around the world. It connects knowledge for meaningful information that can be located, evaluated and delivered by software agents for portable or smart applications. The data sources are people-oriented applications for people’s interaction, e.g., human-to-human communication (H2H), human-to-machine communication (H2M), and machine-to-machine communication (M2M). The companies such as FaceBook, Google Maps, Amazon, Linkedin, Twitter, etc. create the platforms to let people provide content and build their applications.
     - Web 4.0 is a symbiotic and ubiquitous Web or the Web of integration currently (2017). The social-technical Web with intelligent applications is powerful as the human brain in development of telecommunications on nanotechnology. The machines can read contents of the Web. The Web 4.0 is a read/write concurrency Web. Data sources are people, companies, IoT devices and sensors, intelligent machines such as robots without human control, intelligent applications, artificial intelligence products, migration of online functionality into the physical world. 
     - Web 5.0 is called as the web of decentralized smart communicator under development. It is the very early nascent stage of the Symbionet Web. People attempt to get interconnected through Smart Communicator such as Smart phones, Tables and Personal Robots. Data sources come from humans, intelligent devices, robots, personal servers, smart communicators, tablets and personal robots, etc. that can autonomously take action by themselves without human involvement or interference.
Where do performance bottlenecks occur?
Accessing the large-scale Web data flow poses a big challenge in performance such as bottlenecks. The bottleneck is a phenomenal problem where a small number of resources limit an entire system’s capacity or performance. Some of the typical performance bottlenecks in Web applications are listed below (Thomas, 2012):
-          Extended response time of user
-          Extended response time of server
-          High CPU usage
-          Invalid data returned
-          HTTP errors (4xx, 5xx)
-          Lots of open connections 
-          Lengthy queues of requests
-          Memory leaks
-          Extensive table scans of database
-          Database deadlocks
-          Pages  unavailable
For Web 1.0 and partially 2.0, a bottleneck may occur in the network, storage fabric, within servers on the Web. Consequently, data flow is reduced significantly, and affects application performance, particularly for databases and transaction applications. It can freeze the entire system or even cause the applications to crash. For Web 2.0 and 3.0,
the performance bottleneck becomes a big problem on the network when Web users attempt to access large-scale Web data such as Big Data, IoT data for data analytics in the competitive Web-based market and data-driven economy (Rouse, 2007). The performance bottleneck often occurs at the interface between the gateway and wireless devices or sensors. Dynamic scheduler using particle optimization and the algorithm of better resources allocation also alleviate bottlenecks in the networks (Gubbi, Buyya, Marusic, & Palaniswami, 2013).
Root Causes
With data explosion at 44 zettabytes by 2020 (Vizard, 2014), some research companies find that 85% Web data has new data types; data increases ten times every five years; and data likely contains meaningful information (Juniper Research, 2016). The performance bottlenecks become a great threat to communication loss.  The root causes of the performance bottleneck relate to the tremendous amount of data that are generated by humans and machines. The large-scale Web data includes traditional data, user-generated data, and commodity data created and consumed by self-empowered knowledge workers. Technology trend over computing hardware doubles every 18 months by Moore’s Law (Intel, 2015). Mobile computing requires accessing and producing data pervasively. Social networking produced user-generated data such as streaming data, texts massively. Cloud computing reduces the limitation storage. The Internet of Things (IoT) devices, sensors generate and collect data in trillion numbers. Therefore, one of the root causes is the massive volume of messages that causes load imbalance among network entities. Consequently, the load imbalance creates the bottleneck on the Web. Another root cause is the huge data generated by humans and machines, sensors, etc. in daily communication activities may flood the networks and create a bottleneck. A deadlock occurs when two programming tasks want to access the same Web data sets stored in one node in hierarchical groups. A coordinator of a hierarchical group may become a SPOFs (single point of failures) imminently and also potential performance bottleneck in the network (Sakr, & Gaber, 2014).
Proposed solutions
In the new trend regarding “performance bottleneck,” academic and research community has studied Web applications performance symptoms and bottlenecks identification. Researchers have raced to resolve the performance bottlenecks. Some typical solutions include reconfiguring, upgrading or replacing the offending device. For servers, upgrading CPU or memory, and replacing the aging single-CPU servers with dual or quad CPU servers may help. For network level, upgrading switches would work. Proactively monitoring traffic load trends over time and continuing to implement improvements can avoid the performance bottlenecks (Rouse, 2007).   
            A further study shows that the networked computers establish distributed systems for communication by passing messages from a beginning node to other destination nodes for three main reasons, i.e., high performance, low cost, and high quality of service (QoS). The proposed solutions for performance bottlenecks in distributed systems over the Web include: (1) the strategy of distributing or partitioning the workload, and (2) effective mapping of partitions.    
     1. Strategy of distributing and partitioning the workload
            This strategy distributes and partitions the workload across computer nodes by co-locating communicating nodes together to reduce the pressure on the communicating network such as cloud or mobile networks. Consequently, the attempt also improves performance. It minimizes communication overheads while circumventing computation skew among machines. This technique makes great efforts to obtain effective partitioning of the work across machines that are co-located together.    
     2. Effective mapping of partitions.
            This strategy of partitions machines should be done in a way that is totally aware of the underlying network topology. That means a message will hit the large number of switches before it reaches its destination. The strategy tends to reduce the number of hop nodes in rack topology of the cluster to decrease network latency and improve overall performance. The technique enables the relative locality of nodes to arrive at a reasonable inference regarding the rack topology of the cluster of nodes.  
Conclusion

In summary, the document provided a brief overview of the Web evolution in five generations. It identified the main sources of the large-scale Web data in each Web version and discussed some typical places such as load imbalance or interface between wireless sensors and gateway where performance bottlenecks potentially occurred. The root causes of the performance bottlenecks were analyzed and two high-level strategies to eliminate these performance bottlenecks.  


REFERENCES

Brookshear, G. (2012). Computer science: an overview. Addison-Wesley Longman Publishing Co., Inc.

Gubbi, J., Buyya, R., Marusic, S., & Palaniswami, M. (2013). Internet of Things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems, 29(7), 1645-1660.

Intel (2015). 50 years of moore’s law. Retrieved January 15, 2017 from http://www.intel.com/content/www/us/en/silicon-innovations/moores-law-technology.html

Juniper Research (2016). IoT – internet of transformation 2017. Retrieved January 11, 2017 from https://www.juniperresearch.com/document-library/white-papers/iot-~-internet-of-transformation-2017

Patel, K. (2013). Incremental journey for world wide web: introduced with web 1.0 to recent web 5.0 – a survey paper. International Journal of Advanced Research in Computer Science and Software Engineering, 3(10), 410–417.

Rouse, M. ( 2007). Bottleneck. Retrieved January 4th, 2017 from http://searchenterprisewan.techtarget.com/definition/bottleneck

Sakr, S., & Gaber, M. (Eds.). (2014). Large scale and big data: processing and management. CRC Press.

Thomas, T. (2012). Web applications performance symptoms and bottlenecks identification. Retrieved January 04, 2017 from http://www.agileload.com/agileload/blog/2012/11/27/web-applications-performance-symptoms-and-bottlenecks-identification

Trist, E. L., & Bamforth, K. W. (1951). Some social and psychological consequences of the Longwall method. Human relations, 4(3), 3-38.

Vermesan, O., & Friess, P. (2013). Internet of things: converging technologies for smart environments and integrated ecosystems. Retrieved September 04, 2016 from
http://www.internet-of-things research.eu/pdf/Converging_Technologies_for_Smart_Environments_and_Integrated_Ecosystems_IERC_Book_Open_Access_2013.pdf

Vizard, M. (2014). Digital universe expands at an alarming rate. Retrieved August 30, 2016 from http://www.cioinsight.com/it-strategy/storage/slideshows/digital-universe-expands-at-an-alarming-rate.html