Thursday, February 23, 2017

A Survey Report on RDF Data Query Processing

A Survey Report on RDF Data Query Processing

Introduction
The third generation of the World Wide Web (Web 3.0) for semantic affiliation or knowledge content uses Resource Description Framework (RDF) for conceptual foundation and model for the information in Web resources. RDF model that can represent data in a format of machine-readable form has been popularly used in many Web 3.0 applications of knowledge management. The structure of RDF expression is a set of triples or a directed labeled RDF graph. Each triple consists of (1) subject, (2) predicate, and (3) object. Querying RDF triplets is similar to perform large join operations that traditional relational computing is unable to query a large number of star join operations properly. Uniform Resource Identifier (URI) is usually used in RDF to identify a unique Web resource (Brickley & Guha, 2014). SPARQL (Simple Protocol and RDF Query Language) is an RDF query language or semantic query language that can retrieve and manipulate RDF data (W3C, 2004). However, querying large-scale RDF data sets encounters some difficult problem. For instance, the computation of SPARQL queries on large-scale Web data sets requires several joins between subsets of data that cause challenges in code programming. Also, the traditional approaches of the single-place machine cannot scale up or scale out due to increasing available RDF data.
In 2004, Google’s MapReduce framework with a parallel and distributed algorithm as part of the solutions opened up different ways to improve RDF query performance’s efficiency. For example, Amazon’s EC2 (Elastic Compute Cloud), YARS2, 4store are a few solutions. In this survey report, two promising solutions (1) PigSPARQL and (2) MAPSIN to improve RDF query performance will be compared and evaluated based on four criteria: main focus, key technical changes, rationale, and analysis of pros and cons of each solution.

Main focus of the solutions
     1. PigSPARQL

SPARQL recommended by the W3C is standard query language for a large number of RDF datasets. It uses RDF triples whose subject, predicate and object are variables in a sequence of MapReduce iterations of mapping, shuffling, sorting, and reducing tasks. The challenge is joining data sets properly in either map-side or reduce-side in SPARQL query or RDF graph. For reduce-side, join computation is inefficient in selective joins and requires a great deal of network bandwidth. On the map-side, the merge joins are difficult to cascade, and some advantage such as avoiding shuffle is lost (Ghemawat, Gobioff & Leung, 2003). Pig Latin developed by Yahoo! Research is an Apache Hadoop-based language for vast scale data set analysis. For Hadoop, Pig is a high-prioritized project for automatic translation a Pig Latin program into MapReduce jobs. A translation of SPARQL to Pig Latin ensures the main focus that SPARQL query processing on a cluster of MapReduce framework with the advantage of performance enhancement and newer Hadoop versions support from further developing Apache Pig with minimum programming code changes (Blanas, Patel, Ercegovac, Rao, Shekita & Tian, 2010). PigSPARQL is RDF query translation technique to translate complex SPARQL queries to Pig Latin on a MapReduce cluster. Fig 1 illustrates a high-level modular translation process. 
Fig 1: A modular translation process
Source: Adapted from Schatzle et al., 2011.
     2. MAPSIN
Based on HBase’s indexing capabilities, MAPSIN (Map-Side Index Nested Loop Join) improves selective query performance by maintaining the reduce-side joins’ flexibility while using the map-side join’s effectiveness without the change in the framework. Its main focus is to arrange MAPSIN joins with NoSQL HBase’s indexing capabilities for scalable joins on Hadoop MapReduce framework in multiway joins and one-pattern queries. HBase, that is a NoSQL column family or column-oriented database integrated well into Hadoop, can store arbitrary RDF graphs. RDF storage schema defines data modeling’s literature for RDF data in semantic expressions. HBase becomes an extra storage layer on top of HDFS for access data randomly almost at the real time where HDFS does not have this ability. MAPSIN computes the join between two triple patterns in merging compatible mappings in a single map phase only on transferring the needed data. Fig 2 illustrates a typical RDF graph and SPARQL query.
Fig 2: RDF graph and SPARQL query
Source: Adapted from Schatzle et al., 2012.

Main technical changes
The key technical changes in PigSPARQL and MAPSIN are discussed as follows:
     1. PigSPARQL
With RDF format data, SPARQL query language, MapReduce model, and Pig Latin implementation, PiqSPARQL translate complex SPARQL queries through algebraic presentations such as syntax tree, algebra tree, Pig Latin program to MapReduce jobs. Notice that an SPARQL query is addressed on the algebra level and SPAQRL algebra’s expression is interpreted as a tree that is bottom-up evaluated. Query processing time in regarding the size of the RDF data, a feature of MapReduce framework, is linear scalability (Schatzle, Przyjaciel-Zablocki & Lausen, 2011).
     2. MAPSIN
With HBase's indexing capabilities, MAPSIN join uses join technique by computing the join between two triple patterns for output in a single map phase with minimum transferred data. Triple patterns are cascaded in chains for computations of mapping in the iteration of the MAPSIN join. MAPSIN joins’ performance and the HBase's number of index lookups are tightly correlated. It is crucial to minimize the number of index lookups for optimization (Schatzle, Przyjaciel-Zablocki, Dorner, Hornung, & Lausen, 2012).  
Rationale for the technical changes
            The rationale for the technical changes in PigSPARQL and MAPSIN is explained below:
     1. PigSPARQL 
Querying RDF datasets at web-scale is difficult due to the requirement of several joins between data subsets in SPARQL computing and the single-place machine technique cannot scale to meet large RDF data. MapReduce framework with its well scalable properties becomes attractive for SPARQL in the Apache Hadoop platform.
By extracting information from a large RDF dataset then transforming and loading the extracted data into a different format, it appears that cluster-based parallelism outperforms parallel databases. PigSPARQL offers not only RDF data query transformation but also a scalable implementation of the entire ETL-process on a MapReduce cluster. PiqSPARQL provides good performance and excellent scalability for complex analytical queries, but it suffers from poor performance for selective queries because PiqSPAQRL does not have adequately built-in index structures and redundant data shuffling in join computation in the reduce phase (Schatzle et al., 2011).
     2. MAPSIN
According to Schatzle et al. (2012), MAPSIN join takes advantage of the distributed NoSQL HBase’s indexing capabilities to increase selective queries’ performance. HBase as top layer storage on HDFS can process data for MAPSIN joins in the map phase to avoid costly data shuffling. MAPSIN join algorithm can do joins cascade in MapReduce iteration. For consecutive joins, there is no need an additional shuffle and reduce phase to pre-process the data. Especially, MAPSIN joins require no changes in the underlying MapReduce frameworks. Schatzle’s evaluation indicated that significant improvement of selective queries over the common reduce-side join. 

Analysis of pros and cons
            Performance between PigSPARQL (with SPARQL and Pig Latin) and MAPSIN (with HBase) is analyzed. Their pros and cons are discussed as follows:
     1. PigSPARQL
          a. Pros
            PigSPARQL is a translation framework of RDF data queries from SPARQL to Pig Latin on MapReduce cluster without code changes or overhead management. It takes advantage of parallel processing of the large-scale datasets. PigSPARQL provides transformation part in scalable implementation in ETL-based applications. PigSPARQL approach is easier to achieve and handle than direct mapping into MapReduce framework (Afrati & Ullman, 2011). PigSPARQL’s translation uses several optimization strategies effectively. PigSPARQL’s performance and its scaling properties are competitive for complex analytical queries.
          b. Cons
            PigSPARQL performs poorly on selective queries. It does not have built-in index structures. In the reduce phase, it does data shuffling unnecessarily in join computation. 
     2. MAPSIN
          a. Pros
            MAPSIN join takes advantage of the distributed NoSQL database HBase’s indexing capabilities to improve selective queries. It eliminates the costly data shuffling and increases selective queries over the common reduce-side join. HBase can store a space-efficient RDF data schema with favorable access characteristics. With HBase, there are no shuffling join partitions across the network. Users can access the relevant join partners in each iteration. The reduce phase is not used in MAPSIN join. Also, there is no change in the MapReduce framework or Hadoop platform. Map-side joins are much more efficient than reduce-side joins. HBase and MapReduce’s combination allows cascading a sequence of MAPSIN joins without sorting and repartitioning the intermediate output for next iteration (Dean & Ghemawat, 2008). The number of MapReduce iterations and HBase requests are reduced in the multiway optimization. Overall, the MAPSIN join approach outperforms the reduce-side join for selective queries with an improvement of the total query execution times significantly.
          b. Cons
            MAPSIN joins require some strict precondition that makes them difficult to utilize in the sequence of joins.

Conclusion
In summary, the survey report provided two wide-used approaches, i.e., PigSPAQRL and MAPSIN, based on MapReduce framework and Hadoop platform to process RDF data queries in large-scale data sets. The survey report identified the primary focus and described each high-level approach with associated SPAQRL, Pig Latin, HBase, MapReduce framework, and Hadoop HDFS.  The rationale of these technical changes in each solution was provided. The survey report also offered the analysis of pros and cons of PigSPAQRL and MAPSIN. A question that was brought up is how to incorporate MAPSIN joins into PigSPAQRL in a complimentarily hybrid solution for dynamic join method on pattern selectivity and statistics at data loading time.  

REFERENCES

Afrati, F. & Ullman, J. (2011). Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298.
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J. & Tian, Y. (2010). A comparison of join algorithms for log processing in mapreduce. In: SIGMOD.
Brickley, D., & Guha, R. V. (eds). (2014, February 25). RDF Schema 1.1. W3C. Retrieved from http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
Dean, J. & Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113.
Ghemawat, S., Gobioff, H. & Leung, S.T. (2003). The google file system. in: proc. sosp, pp. 29–43.
Sakr, S., & Gaber, M. (Eds.). (2014). Large scale and big data: processing and management. Boca Raton, FL: CRC Press.
W3C (The World Wide Web Consortium). (2004). Resource description framework:
concepts and abstract syntax, W3C recommendation.
Schatzle, A., Przyjaciel-Zablocki, M., Dorner, C., Hornung, T. & Lausen, G. (2012). Cascading map-side joins over HBase for scalable join processing. Retrieved January 29, 2017 from http://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper5.pdf
Schatzle, A., Przyjaciel-Zablocki, M. & Lausen, G. (2011). Pigsparql: mapping sparql to pig latin. Retrieved January 31, 2017 from
http://www.csd.uoc.gr/~hy561/papers/storageaccess/largescale/PigSPARQL-%20Mapping%20SPARQL%20to%20Pig%20Latin.pdf










No comments:

Post a Comment