A
Survey Report on RDF Data Query Processing
Introduction
The third generation of the World
Wide Web (Web 3.0) for semantic affiliation or knowledge content uses Resource
Description Framework (RDF) for conceptual foundation and model for the
information in Web resources. RDF model that can represent data in a format of
machine-readable form has been popularly used in many Web 3.0 applications of
knowledge management. The structure of RDF expression is a set of triples or a
directed labeled RDF graph. Each triple consists of (1) subject, (2) predicate,
and (3) object. Querying RDF triplets is similar to perform large join
operations that traditional relational computing is unable to query a large
number of star join operations properly. Uniform Resource Identifier (URI) is
usually used in RDF to identify a unique Web resource (Brickley & Guha,
2014). SPARQL (Simple Protocol and RDF Query Language) is an RDF query language
or semantic query language that can retrieve and manipulate RDF data (W3C,
2004). However, querying large-scale RDF data sets encounters some difficult
problem. For instance, the computation of SPARQL queries on large-scale Web
data sets requires several joins between subsets of data that cause challenges
in code programming. Also, the traditional approaches of the single-place
machine cannot scale up or scale out due to increasing available RDF data.
In 2004, Google’s MapReduce
framework with a parallel and distributed algorithm as part of the solutions
opened up different ways to improve RDF query performance’s efficiency. For
example, Amazon’s EC2 (Elastic Compute Cloud), YARS2, 4store are a few
solutions. In this survey report, two promising solutions (1) PigSPARQL and (2)
MAPSIN to improve RDF query performance will be compared and evaluated based on
four criteria: main focus, key technical changes, rationale, and analysis of
pros and cons of each solution.
Main focus of the solutions
1. PigSPARQL
SPARQL recommended by the W3C is standard query language for a large number
of RDF datasets. It uses RDF triples whose subject, predicate and object are
variables in a sequence of MapReduce iterations of mapping, shuffling, sorting,
and reducing tasks. The challenge is joining data sets properly in either
map-side or reduce-side in SPARQL query or RDF graph. For reduce-side, join
computation is inefficient in selective joins and requires a great deal of
network bandwidth. On the map-side, the merge joins are difficult to cascade,
and some advantage such as avoiding shuffle is lost (Ghemawat, Gobioff & Leung,
2003). Pig Latin developed by Yahoo! Research is an Apache Hadoop-based
language for vast scale data set analysis. For Hadoop, Pig is a
high-prioritized project for automatic translation a Pig Latin program into
MapReduce jobs. A translation of SPARQL to Pig Latin ensures the main focus that SPARQL query processing
on a cluster of MapReduce framework with the advantage of performance
enhancement and newer Hadoop versions support from further developing Apache
Pig with minimum programming code changes (Blanas, Patel, Ercegovac, Rao,
Shekita & Tian, 2010). PigSPARQL is RDF query translation technique to translate
complex SPARQL queries to Pig Latin
on a MapReduce cluster. Fig 1 illustrates a high-level modular translation
process.
Fig 1: A modular translation
process
Source: Adapted from
Schatzle et al., 2011.
2. MAPSIN
Based on HBase’s indexing capabilities, MAPSIN
(Map-Side Index Nested Loop Join) improves selective query performance by
maintaining the reduce-side joins’ flexibility while using the map-side join’s
effectiveness without the change in the framework. Its main focus is to arrange MAPSIN joins with NoSQL HBase’s indexing
capabilities for scalable joins on Hadoop MapReduce framework in multiway joins
and one-pattern queries. HBase, that is a NoSQL column family or
column-oriented database integrated well into Hadoop, can store arbitrary RDF
graphs. RDF storage schema defines data modeling’s literature for RDF data in
semantic expressions. HBase becomes an extra storage layer on top of HDFS for
access data randomly almost at the real time where HDFS does not have this
ability. MAPSIN computes the join between two triple patterns in merging
compatible mappings in a single map phase only on transferring the needed data.
Fig 2 illustrates a typical RDF graph and SPARQL query.
Fig 2: RDF graph and
SPARQL query
Source: Adapted from
Schatzle et al., 2012.
Main technical changes
The key technical changes in PigSPARQL and MAPSIN are discussed as follows:
1. PigSPARQL
With RDF format data, SPARQL query language, MapReduce model, and Pig Latin
implementation, PiqSPARQL translate complex SPARQL queries through algebraic
presentations such as syntax tree, algebra tree, Pig Latin program to MapReduce
jobs. Notice that an SPARQL query is addressed on the algebra level and SPAQRL
algebra’s expression is interpreted as a tree that is bottom-up evaluated.
Query processing time in regarding the size of the RDF data, a feature of
MapReduce framework, is linear scalability (Schatzle, Przyjaciel-Zablocki &
Lausen, 2011).
2. MAPSIN
With HBase's indexing capabilities, MAPSIN join uses join technique by
computing the join between two triple patterns for output in a single map phase
with minimum transferred data. Triple patterns are cascaded in chains for
computations of mapping in the iteration of the MAPSIN join. MAPSIN joins’
performance and the HBase's number of index lookups are tightly correlated. It
is crucial to minimize the number of index lookups for optimization (Schatzle,
Przyjaciel-Zablocki, Dorner, Hornung, & Lausen, 2012).
Rationale for the technical changes
The rationale for the
technical changes in
PigSPARQL and MAPSIN is explained below:
1. PigSPARQL
Querying RDF datasets at web-scale is difficult due to the requirement of several
joins between data subsets in SPARQL computing and the single-place machine
technique cannot scale to meet large RDF data. MapReduce framework with its
well scalable properties becomes attractive for SPARQL in the Apache Hadoop
platform.
By extracting
information from a large RDF dataset then transforming and loading the
extracted data into a different format, it appears that cluster-based
parallelism outperforms parallel databases. PigSPARQL offers not only RDF data
query transformation but also a scalable implementation of the entire
ETL-process on a MapReduce cluster. PiqSPARQL provides good performance and
excellent scalability for complex analytical queries, but it suffers from poor
performance for selective queries because PiqSPAQRL does not have adequately
built-in index structures and redundant data shuffling in join computation in
the reduce phase (Schatzle et al., 2011).
2. MAPSIN
According to Schatzle et al. (2012), MAPSIN join takes advantage of the
distributed NoSQL HBase’s indexing capabilities to increase selective queries’
performance. HBase as top layer storage on HDFS can process data for MAPSIN
joins in the map phase to avoid costly data shuffling. MAPSIN join algorithm
can do joins cascade in MapReduce iteration. For consecutive joins, there is no
need an additional shuffle and reduce phase to pre-process the data.
Especially, MAPSIN joins require no changes in the underlying MapReduce
frameworks. Schatzle’s evaluation indicated that significant improvement of
selective queries over the common reduce-side join.
Analysis of pros and cons
Performance between PigSPARQL (with
SPARQL and Pig Latin) and MAPSIN (with HBase) is analyzed. Their pros and cons
are discussed as follows:
1. PigSPARQL
a. Pros
PigSPARQL is a translation framework
of RDF data queries from SPARQL to Pig Latin on MapReduce cluster without code
changes or overhead management. It takes advantage of parallel processing of
the large-scale datasets. PigSPARQL provides transformation part in scalable
implementation in ETL-based applications. PigSPARQL approach is easier to
achieve and handle than direct mapping into MapReduce framework (Afrati & Ullman, 2011). PigSPARQL’s translation uses several
optimization strategies effectively. PigSPARQL’s performance and its scaling
properties are competitive for complex analytical queries.
b. Cons
PigSPARQL performs poorly on
selective queries. It does not have built-in index structures. In the reduce
phase, it does data shuffling unnecessarily in join computation.
2. MAPSIN
a. Pros
MAPSIN join takes advantage of the
distributed NoSQL database HBase’s indexing capabilities to improve selective
queries. It eliminates the costly data shuffling and increases selective
queries over the common reduce-side join. HBase can store a space-efficient RDF
data schema with favorable access characteristics. With HBase, there are no shuffling
join partitions across the network. Users can access the relevant join partners
in each iteration. The reduce phase is not used in MAPSIN join. Also, there is
no change in the MapReduce framework or Hadoop platform. Map-side joins are
much more efficient than reduce-side joins. HBase and MapReduce’s combination
allows cascading a sequence of MAPSIN joins without sorting and repartitioning
the intermediate output for next iteration (Dean & Ghemawat, 2008). The number of MapReduce iterations and HBase
requests are reduced in the multiway optimization. Overall, the MAPSIN join
approach outperforms the reduce-side join for selective queries with an improvement
of the total query execution times significantly.
b. Cons
MAPSIN joins require some strict
precondition that makes them difficult to utilize in the sequence of joins.
Conclusion
In summary, the survey report provided two wide-used approaches, i.e.,
PigSPAQRL and MAPSIN, based on MapReduce framework and Hadoop platform to
process RDF data queries in large-scale data sets. The survey report identified
the primary focus and described each high-level approach with associated
SPAQRL, Pig Latin, HBase, MapReduce framework, and Hadoop HDFS. The rationale of these technical changes in
each solution was provided. The survey report also offered the analysis of pros
and cons of PigSPAQRL and MAPSIN. A question that was brought up is how to
incorporate MAPSIN joins into PigSPAQRL in a complimentarily hybrid solution
for dynamic join method on pattern selectivity and statistics at data loading
time.
REFERENCES
Afrati, F. &
Ullman, J. (2011). Optimizing multiway joins in a map-reduce environment. IEEE
Trans. Knowl. Data Eng. 23(9), 1282–1298.
Blanas, S.,
Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J. & Tian, Y. (2010). A
comparison of join algorithms for log processing in mapreduce. In: SIGMOD.
Brickley, D.,
& Guha, R. V. (eds). (2014, February 25). RDF Schema 1.1. W3C.
Retrieved from http://www.w3.org/TR/2014/REC-rdf-schema-20140225/
Dean, J. &
Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters.
Communications of the ACM 51(1), 107–113.
Ghemawat, S.,
Gobioff, H. & Leung, S.T. (2003). The google file system. in: proc. sosp,
pp. 29–43.
Sakr, S., &
Gaber, M. (Eds.). (2014). Large scale and big data: processing and
management. Boca Raton, FL: CRC Press.
W3C (The World
Wide Web Consortium). (2004). Resource description framework:
concepts and abstract syntax, W3C recommendation.
Schatzle, A.,
Przyjaciel-Zablocki, M., Dorner, C., Hornung, T. & Lausen, G. (2012).
Cascading map-side joins over HBase for scalable join processing. Retrieved
January 29, 2017 from http://ceur-ws.org/Vol-943/SSWS_HPCSW2012_paper5.pdf
Schatzle, A.,
Przyjaciel-Zablocki, M. & Lausen, G. (2011). Pigsparql: mapping sparql to
pig latin. Retrieved January 31, 2017 from
http://www.csd.uoc.gr/~hy561/papers/storageaccess/largescale/PigSPARQL-%20Mapping%20SPARQL%20to%20Pig%20Latin.pdf