System Π：一个基于RDF数据模型的超图表示的原生RDF存储系统

吴刚; 李涓子; 胡建强; 王克宏

摘要: 资源描述框架RDF（Resource Description Framework）处于语义Web体系结构中的数据交换层。为了管理日益增加的RDF数据，RDF存储系统不但应提供必要的可伸缩性和执行效率，而且应提供足够的推理能力。虽然已有的RDF存储系统针对这些目标已经取得了一定进展，然而仍有较大空间提升整体性能。
本文提出一种原生方式的RDF存储系统——System Π，它能更好地权衡系统可伸缩性、查询效率和推理能力。System Π采用RDF的超图表达形式作为数据模型实现对RDF数据的持久存储，能够有效避免访问RDF数据时导致的数据模型转换开销。基于这种原生方式的存储机制，设计了一系列有效的语义查询处理技术：
首先，设计了包括结点值索引（Vertex Value Index）、用于传递闭包计算的基于素数编码标记机制的索引（PLSD Index）、三种三元组索引（Triple Index）在内的多种索引结构以加快RDF数据访问，同时也支持最基本的在超图上进行遍历的数据访问方式（Hypergraph Traversal）；其次，在pD*语义下提出了一种结合backward chaining和forward chaining的混合方式推理策略，能够在相对较低的计算复杂度下支持OWL-Lite推理；最后，通过向SPARQL查询中加入新的代数操作符传递操作符（Transitive Operator）、对称操作符（Symmetric Operator）、反向操作符（Inverse Operator），来扩展SPARQL代数以显式地在逻辑查询计划上直接表达推理语义。此外，研究还采用了对URI进行MD5哈希和对模式数据进行缓存（Schema Cache）等切实可行的实现技术。
在LUBM测试基准和实际数据集SwetoDblp （一个大规模RDF数据格式的DBLP数据集）上进行的大量实验性能测试，考察了数据加载时间、生成库大小、查询响应时间、查询完整性和正确性、以及一种综合度量指标值，对比了URI MD5哈希和模式数据缓存技术对性能的影响。实验结果显示System Π相对于其他对比系统具有更好的综合度量指标值。
作为下一步工作，我们将尝试在System Π中：增加对RDF数据更新操作的支持；集成一个SPARQL解析器；研究基于代价估计的查询优化技术；在多种测试基准上与更多的RDF数据存储系统进行对比，以进一步改进System Π的设计与实现。
System Π作为一个RDF数据存储原型系统是一个非常好的RDF数据管理理论和方法的实践平台。System Π可以作为一个RDF数据管理底层模块与各种语义Web应用相结合。一个典型的应用场景是和语义Web搜索引擎相结合，用于存储通过爬虫获取到的各类RDF文档，并通过API向用户查询模块提供结构化查询支持。

Abstract: RDF is the data interchange layer for the Semantic Web. In order to manage the increasing amount of RDF data, an RDF repository should provide not only the necessary scalability and efficiency, but also sufficient inference capabilities. Though existing RDF repositories have made progress towards these goals, there is still ample space for improving the overall performance. In this paper, we propose a native RDF repository, System , to pursue a better tradeoff among system scalability, query efficiency, and inference capabilities. System takes a hypergraph representation for RDF as the data model for its persistent storage, which effectively avoids the costs of data model transformation when accessing RDF data. Based on this native storage scheme, a set of efficient semantic query processing techniques are designed. First, several indices are built to accelerate RDF data access including a value index, a labeling scheme for transitive closure computation, and three triple indices. Second, we propose a hybrid inference strategy under the semantics to support inference for OWL-Lite with a relatively low computational complexity. Finally, we extend the SPARQL algebra to explicitly express inference semantics in logical query plan by defining some new algebra operators. In addition, MD5 hash value of URI and schema level cache are introduced as practical implementation techniques. The results of performance evaluation on the LUBM benchmark and a real data set show that System has a better combined metric value than other comparable systems.

System Π：一个基于RDF数据模型的超图表示的原生RDF存储系统

System |Π: A Native RDF Repository Based on the Hypergraph Representation for RDF Data Model