This document proposes an RDF vocabulary to describe indexes which are intended to be published on the Web, to facilitate the search of data by Semantic Web agents. Some indexing and index-querying techniques are also introduced.
This document is a result of a collaboration between INRIA and Startin'Blox about a research project on indexing in a Solid ecosystem.
The Semantic Web vision consists in publishing on the Web machine-readable documents so that machine, like humans, become able to browse the Web, following links from one document to another, in order to gather information and answer complex questions.
Hovewer, when a lot of data is available, gathering information and answering complex questions becomes very ineffiscient without indexes. Indexes are made of meta-data that allow to find data more easily and more quickly. Indexing is a widely used mechanism in all kind of software, especially in databases.
This document proposes an RDF vocabulary to describe one kind of such indexes, which are also intended to be published on the Web, to facilitate the search of data by Semantic Web agents. In addition, this document also presents some indexing and index-querying techniques.
This need for a standard indexing vocabulary appeared in the context of the Solid protocol where several applications interact in the same way with the same data to achieve interoperability. Indeed, as the data is decoupled from the applications, these applications must agree on a client-to-client protocol to work together. If they want to be able to use the same indexes, these must be standardized.
Index A data-structure serving as a summary of a dataset, aiming at quickly locating data in the dataset without having to scan it exhaustively.
Server-side application TBD.
Client-to-client protocol TBD.
| Prefix | Namespace | Description |
|---|---|---|
idx |
https://ns.inria.fr/idx/terms# | Indexing ontology |
sh |
https://www.w3.org/ns/shacl# | [[SHACL]] |
rdf |
http://www.w3.org/1999/02/22-rdf-syntax-ns# | [[RDF]] |
foaf |
http://xmlns.com/foaf/0.1/ | |
schema |
http://schema.org/ |
An index is a RDF document [[RDF11-CONCEPTS]] of type idx:Index. It contains entries of type
idf:IndexEntry, each one
being linked to a shape with the idx:hasShape predicate.
An entry MUST refer to a resource matching the shape or to another index referring to such resources, with
predicate idx:hasTarget or idx:hasSubIndex respectively.
A shape is...
This document proposes the Indexing ontology as vocabulary for describing indexes. This ontology is using [[SHACL]] shapes to express what is indexed.
For instance an index of people living in Paris could be expressed like in the example.
Meta-indexes are indexes that are indexing other indexes. They can be used to divide a entire index into smaller parts. While more queries are needed to load the data of interest, meta-indexes might reduce the size of the transfered data by targeting parts with precision. It can also give faster results especially when combined with a heuristics like the one detailed in .
Source selection is a technique that consists in selecting from a set of indexes those that are judged relevant. This way it's possible to get results faster. This technique relies on one or several heuristics. One example of heuristic is the number of item that can be found in an index. Setting a minimun number of results might reduce the number of selected indexes.
When the data to be found is distributed across multiple indexes, one has to query each of them to complete. Without a source selection step before querying, there is no indication on which index to query: they are considered all equal. This can be very ineffiscient as some indexes without any valid results might be queried.
Source ordering consists in querying the most relevant indexes first. This technique uses one or several
criterias to order the indexes. One example of a criteria is the number of results contained in an index.
In the subindex of :entry2 will be queried before that
of :entry1
as it presents more results (the value of the idx:hasCount is higher).
Source ordering can be used to get results faster as it introduces a priority level between indexes based on some heuristics. Without source ordering results can be slower to come especially when relevant data is part of an index that is at the end of the querying queue.
MUST use RDF lists.
Use [[LDP-Paging]]
Indexing always starts from a federated point.
When indexing .
TODO: indexing private data.
When indexes are only present on each storage we talk about distributed indexing.
Indexes can be discovered by any means. However implementers SHOULD use the mechanisms recommened by the client-to-client protocol if any.
A possible discovery mean is Type Indexes. They are a well known mechanism used by applications implementing the [[Solid-protocol]].
Comunica can be used with streams.
With Solid data is distributed accross storages. Depending on the use cases, a client application can search for data in one or several storage(s). The mechanisms described in this document can be used to index data on Solid storages that client-side or [=server-side applications=] can use to allow faster searching. As the Solid server does not need to know about these indexes, editors of application should define the indexes they are using in their [=client-to-client protocol=].
The current Solid protocol (version 0.11.0) does not provide a built-in server mechanism to search for data. This is left to [=client-to-client protocols=]. This presents the advantage of keeping the Solid protocol as simple as possible. The indexing mechanisms listed in this document stay in the perimeter of client-to-client protocols.
Indexing can be done with LDP when the folder structure has a defined meaning. For instance, messages of the [Solid chat] protocol are stored in a hierachy of date folder. Therefore this client rule can be used to find data faster. For instance one can find the messages from 2024 in the 2024 folder.
Solid Type Indexes are designed to...
This is required for specifications that contain normative material.