Evan Bolton 1, Arto Bendiken 2
Affiliation: 1 National Center for Biotechnology Information, National Institute of Health; 2 Datagraph, Inc
We would like to propose a tutorial on a compact, searchable RDF representation format called HDT (Header Dictionary Triple) and how to serve PubChemRDF HDT file on the Web under the Triple Pattern Fragment (TPF) framework. The tutorial will introduce the basic concepts of HDT and TPF, as well as explain in details the benefit of using HDT files to store and exchange very large RDF data set, like PubChemRDF comprising billions of triples. We will demonstrate how to serve the PubChemRDF data stored in HDT files using TPF framework and how to send SPARQL query to the PubChemRDF TPF server.
The following topics will be addressed:
- The available toolkits to convert PubChemRDF data into HDT file
- The benefits of HDT serialization format
- Serving PubchemRDF data in HDT file using Jena Fuseki and TPF framework
- Current software development status and future direction
Background
HDT serialization
Rather than sharing RDF data in textual serialization formats that waste a lot of bandwidth to download and are expensive to index and search, HDT (Header, Dictionary, Triples) is a binary RDF serialization format optimized for storage or exchange data over the network [1]. HDT file is already indexed and ready for browsing and querying. HDT represents a compact data structure that keeps big datasets compressed to save space while maintaining search and browse operations. This makes it an ideal format for storing and sharing RDF datasets on the Web. HDT file is read-only and enables fast, concurrent querying. The internal compression techniques of HDT allow that most part of the data (or even the whole dataset) can be kept in main memory, which is several orders of magnitude faster than disks. It also allows multiple queries to be dispatched per second using multiple threads. It is noteworthy that HDT file can be further compressed using conventional compression techniques, and the compression ratio depends on the RDF data structure.
An HDT-encoded dataset is composed by three logical components (Header, Dictionary, and Triples), carefully designed to address RDF peculiarities. The Header holds metadata describing an HDT semantic dataset using plain RDF. It acts as an entry point for the consumer, who can have an initial idea of key properties of the content even before retrieving the whole dataset. The Dictionary is a catalog comprising all the different terms used in the dataset, such as URIs, literals and blank nodes. A unique identifier (ID) is assigned to each term, enabling triples to be represented as tuples of three IDs, which reference their respective subject/predicate/object term from the dictionary. This is a first step toward compression, since it avoids long terms to be repeated again and again. Moreover, similar strings are now stored together inside the dictionary, fact that can be exploited to improve compression even more. As stated before, the RDF triples can now be seen as tuples of three IDs. Therefore, the Triples section models the graph of relationships among the dataset terms. By understanding the typical properties of RDF graphs, we can come up with more efficient ways of representing this information, both to reduce the overall size, but also to provide efficient search/traversal operations.
The main strengths of the HDT format are: 1) Compactness: Use less space to encode the same information, therefore saving storage, communication bandwidth and transfer times. Designed taking into account the peculiarities of RDF, improves the figures of general compressors, such as GZIP, BZIP2; 2) Clean Publication and exchange: Keeps the metadata easily accessible in the Header, with information such as provenance as statistics. The content separates the terms (Dictionary) from the structure of relationships among them (Triples); 3) On-demand indexed access: permits fast search operations to parts of the file without having to decompress it as a whole.
In summary, the binary HDT file can be directly loaded into the memory of a computational system and accessed as a data structure, therefore avoiding the expensive indexing operations. Thanks to the compression techniques, more data is readily available at the higher levels of the memory hierarchy. Data in faster memory always means faster retrieval operations. It has been submitted for W3C recommendation.
Triple Pattern Fragment Framework
Linked Data Fragment (LDF) of a Linked Data dataset is a resource consisting of those triples of this dataset that match a specific selector, together with metadata and hypermedia controls. LDF defines new ways of publishing Linked Data, in which the query workload is distributed between clients and servers. A Triple Pattern Fragment (TPF) is a Linked Data Fragment with a triple pattern as selector, count metadata, and controls to retrieve any other TPF of the same dataset, in particular other fragments the matching elements belong to. Fragments are paged to contain only a part of the data. TPF minimize server processing, while enabling efficient querying by clients. Data dumps allow full querying on the client side, but all processing happens locally. Therefore, it is not Web querying: the data is likely outdated and only comes from a single source. Subject pages also require low server effort, but they do not allow efficient querying of all graph patterns. For instance, finding a list of artists is near impossible with regular dereferencing or Linked Data querying. Compared to SPARQL results, TPF are easier to generate because the server effort is bounded. In contrast, each SPARQL query can demand an unlimited amount of server resources.
With TPF, we aim to discuss ways to publish Linked Data in addition to SPARQL endpoints, subject pages, and data dumps. In particular, we want to enable clients to query the Web of Data, which is impossible/unreliable today because of the low availability of public SPARQL endpoints. New types of Linked Data Fragments, such as TPF, can vastly improve server availability while still enabling client-side querying. In short, the goal of Linked Data Fragments is to build servers that enable intelligent clients.
PubChemRDF Project
PubChem is an open repository for chemical substance description, biological activities and biomedical annotations. Since launched in 2004 as component of the Molecular Libraries Roadmap Initiatives of the U.S. National Institutes of Health (NIH), PubChem has been rapidly growing to a sizeable system, and serve as a key chemical information resource for many scientific area such as cheminformatics, chemical biology, medicinal chemistry, and drug discovery. PubChem contains the largest corpus of publicly available chemical information. As of July 2015, it has more than 223 million depositor-provided chemical substance descriptions, 91 million unique chemical structures, and one million biological assay descriptions, covering about ten thousand protein target sequences. PubChem’s data are provided by more than 450 contributors, including university labs, government agencies, pharmaceutical companies, chemical vendors, publishers, and a number of chemical biology resources. The data contributed provided by these contributors are not just limited to cover small molecules, but also include siRNAs, miRNAs, carbohydrates, lipids, peptides, chemically modified macromolecules and many others.
PubChemRDF provides a new ability for researchers to utilize schema-less data systems and so-called RDF-triple stores with SPARQL query engines to analyze data available within PubChem. The selected PubChemRDF data files in any subdomains can be downloaded from the PubChem FTP site, and imported into a RDF triple/quad store (such as OpenLink Virtuoso), which usually provides SPARQL query interface. Alternatively, PubChemRDF data can also be loaded into RDF-aware graph databases such as Neo4j. The RDF graph representations also enable the applications of graph-theoretic algorithms for mining semantic associations between chemical and biological entities. In addition to bulk download via FTP, PubChemRDF also provides programmatic data access through REST-full interface. In addition to dereference URIs, PubChemRDF REST interface provides simple SPARQL-like query capabilities for grouping and filtering relevant resources, as well as string and substring search functions.
Tutorial at SWAT4LS
We propose a tutorial on SWAT4LS 2016 where we will first give a general overview about HDT. In particular, we will go through the basic concept and demonstrate the benefits of storing and exchanging RDF data using HDT. We will talk about the currently available toolkits to convert billions of RDF triples into HDT file. We will also talk about how to serve the HDT file on the Web using Jena Fuseki and TPF framework. Once the users are familiar with HDT and TPF framework, the tutorial would then continue with a couple of SPARQL queries to address complicated biomedical questions against PubChemRDF data. How to send SPARQL query and get results back will be demonstrated as well.
Audience
The tutorial is aimed at a group of audiences of:
- Semantic Web data scientists who build the links and applications upon the RDF-based life science data;
- Data publishers who are interested in exchange their large RDF data using compact data representation and serve the data with high availability and at low cost.
- Research scientists interested in using linked data in PubChemRDF for data analysis and complicated queries;