This document proposes an RDF vocabulary to describe indexes which are intended to be published on the Web, to facilitate the search of data by Semantic Web agents. Some indexing and index-querying techniques are also introduced.

This document is a result of a collaboration between INRIA and Startin'Blox about a research project on indexing in a Solid ecosystem.

Introduction

The Semantic Web vision consists in publishing on the Web machine-readable documents so that machine, like humans, become able to browse the Web, following links from one document to another, in order to gather information and answer complex questions.

Hovewer, when a lot of data is available, gathering information and answering complex questions becomes very ineffiscient without indexes. Indexes are made of meta-data that allow to find data more easily and more quickly. Indexing is a widely used mechanism in all kind of software, especially in databases.

This document proposes an RDF vocabulary to describe one kind of such indexes, which are also intended to be published on the Web, to facilitate the search of data by Semantic Web agents. In addition, this document also presents some indexing and index-querying techniques.

This need for a standard indexing vocabulary appeared in the context of the Solid protocol where several applications interact in the same way with the same data to achieve interoperability. Indeed, as the data is decoupled from the applications, these applications must agree on a client-to-client protocol to work together. If they want to be able to use the same indexes, these must be standardized.

Terminology

Index A data-structure serving as a summary of a dataset, aiming at quickly locating data in the dataset without having to scan it exhaustively.

Server-side application TBD.

Client-to-client protocol TBD.

Namespaces

Prefix Namespace Description
idx https://ns.inria.fr/idx/terms# Indexing ontology
sh https://www.w3.org/ns/shacl# [[SHACL]]
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns# [[RDF]]
foaf http://xmlns.com/foaf/0.1/
schema http://schema.org/

Ontology description

An index is a RDF document [[RDF11-CONCEPTS]] of type idx:Index. It contains entries of type idf:IndexEntry, each one being linked to a shape with the idx:hasShape predicate.

An entry MUST refer to a resource matching the shape or to another index referring to such resources, with predicate idx:hasTarget or idx:hasSubIndex respectively.

A shape is...

Indexes

This document proposes the Indexing ontology as vocabulary for describing indexes. This ontology is using [[SHACL]] shapes to express what is indexed.

General indexes

For instance an index of people living in Paris could be expressed like in the example.

Meta-indexes

Meta-indexes are indexes that are indexing other indexes. They can be used to divide a entire index into smaller parts. While more queries are needed to load the data of interest, meta-indexes might reduce the size of the transfered data by targeting parts with precision. It can also give faster results especially when combined with a heuristics like the one detailed in .

Source selection

Source selection is a technique that consists in selecting from a set of indexes those that are judged relevant. This way it's possible to get results faster. This technique relies on one or several heuristics. One example of heuristic is the number of item that can be found in an index. Setting a minimun number of results might reduce the number of selected indexes.

When the data to be found is distributed across multiple indexes, one has to query each of them to complete. Without a source selection step before querying, there is no indication on which index to query: they are considered all equal. This can be very ineffiscient as some indexes without any valid results might be queried.

Source ordering

Source ordering consists in querying the most relevant indexes first. This technique uses one or several criterias to order the indexes. One example of a criteria is the number of results contained in an index. In the subindex of :entry2 will be queried before that of :entry1 as it presents more results (the value of the idx:hasCount is higher).

Source ordering can be used to get results faster as it introduces a priority level between indexes based on some heuristics. Without source ordering results can be slower to come especially when relevant data is part of an index that is at the end of the querying queue.

Sorting the index entries

MUST use RDF lists.

Pagination

Use [[LDP-Paging]]

Indexing strategies

Indexing always starts from a federated point.

Indexed data

When indexing .

TODO: indexing private data.

Distributed indexing

When indexes are only present on each storage we talk about distributed indexing.

Federated indexing

Hybrid indexing

Discovery of indexes

Indexes can be discovered by any means. However implementers SHOULD use the mechanisms recommened by the client-to-client protocol if any.

A possible discovery mean is Type Indexes. They are a well known mechanism used by applications implementing the [[Solid-protocol]].

Querying indexes

Comunica can be used with streams.

Indexing with Solid

With Solid data is distributed accross storages. Depending on the use cases, a client application can search for data in one or several storage(s). The mechanisms described in this document can be used to index data on Solid storages that client-side or [=server-side applications=] can use to allow faster searching. As the Solid server does not need to know about these indexes, editors of application should define the indexes they are using in their [=client-to-client protocol=].

The current Solid protocol (version 0.11.0) does not provide a built-in server mechanism to search for data. This is left to [=client-to-client protocols=]. This presents the advantage of keeping the Solid protocol as simple as possible. The indexing mechanisms listed in this document stay in the perimeter of client-to-client protocols.

Indexing using the LDP structure

Indexing can be done with LDP when the folder structure has a defined meaning. For instance, messages of the [Solid chat] protocol are stored in a hierachy of date folder. Therefore this client rule can be used to find data faster. For instance one can find the messages from 2024 in the 2024 folder.

Relation to Solid Type indexes

Solid Type Indexes are designed to...

Indexing private data?

This is required for specifications that contain normative material.