These are the words of a madman, not necessarily true nor possible.
1. Useful SPARQL concepts
1.0.1. Endpoint
individual service able to reply to SPARQL queries (eg. tracker-store, or https://query.wikidata.org/)
1.0.2. GRAPH
Individual collections of RDF triples
https://www.w3.org/TR/sparql11-query/#rdfDataset
1.0.3. DESCRIBE/CONSTRUCT
Query syntax to generate RDF data out of a dataset
https://www.w3.org/TR/sparql11-query/#describe
https://www.w3.org/TR/sparql11-query/#construct
1.0.4. LOAD
Update syntax to incorporate external resources (eg. RDF) into a graph in the dataset
https://www.w3.org/TR/sparql11-update/#load
1.0.5. SERVICE
Syntax to distribute queries across SPARQL endpoints and merge the results
https://www.w3.org/TR/2013/REC-sparql11-federated-query-20130321/#introduction
2. Concepts to explore
2.0.1. DESCRIBE/CONSTRUCT
DESCRIBE/CONSTRUCT at large scale are reasonably easy now that tracker supports unrestricted queries
2.0.2. GRAPH
Tracker has very rudimentary support for graphs:
- No two graphs may have the same triple (cardinality is global)
- Unique indices are global too
- FROM/FROM NAMED/GRAPH syntax aren't entirely right
At the heart of all this is the approach to store graph data in the database, every property has an additional *:graph column, but data from all graphs is actually merged in the same tables under the same restrictions.
Graphs may be generally considered isolated units, a more 1:1 approach would consist of having graphs be stored in individual databases, that may be later merged together by the engine (eg. through https://www.sqlite.org/unionvtab.html). The CLEAR/CREATE/DROP/COPY/MOVE/ADD additional graph management syntax from sparql1.1 might quickly fall in place with this.
2.0.2.1. Caveats/pitfalls
resource IDs do need to be global, tracker reset must consider.
sqlite limitations apply, point 11 in https://sqlite.org/limits.html is relevant to this approach.
2.0.3. LOAD
We have most of the pieces to implement LOAD, as we already have a tracker-store DBus method that pretty much does this. Basically it turns into a language feature then. However, it might benefit from graphs as described above.
2.0.4. SERVICE
SERVICE might be possible to implement through a virtual table (https://sqlite.org/vtab.html), Tracker roughly provides this functionality through tracker_sparql_connection_remote_new(), although that connects to an specific endpoint instead of blending it into the query.
2.0.4.1. Caveats/pitfalls
- Virtual tables have a fixed set of columns set at construction, might require some JIT/dynamic management of tables in TEMP/MEMORY
- Partially resolving the local query in order to produce the most optimized remote query (eg. provide values/ranges) seems hard. Just not doing that and letting sqlite handle it all through the virtual table sounds feasible, but slow.
3. Piecing it together
3.0.1. Backups
An application might be able to do:
DESCRIBE ?u WHERE { ?u a nmm:Photo ; nfo:belongsToContainer/nie:url 'file:///run/media...' }
- And serialize the results into a file, which might then be loaded through:
LOAD SILENT <file:///...>
This essentially supersedes tracker_sparql_connection_load().
3.0.2. Sandboxing (Option 1)
Built upon graphs as individual databases. Those can be selectively exposed into the sandbox FS.
3.0.2.1. Pros
- Allows direct readonly access within the sandbox
- Single tracker-store, outside the sandbox
- Minimal changes to sparql around
3.0.2.2. Cons
- All updates still have to happen through DBus
- Beware of limits on the number of attached databases
3.0.2.3. ???
- Miners stay in the host
- Data isolation comes from miners, eg. music and photos would get distinct graphs, and applications would request access to those.
3.0.3. Sandboxing (Option 1.5)
On top of the previous option, we could make a TrackerSparqlConnection that has a private writable store (like tracker_sparql_connection_local_new), but can get readonly access to the global store.
3.0.3.1. Pros
- Allows direct readonly access within the sandbox
- Updates happen to the local private store, within the sandbox. The host data cannot be changed.
- Minimal changes to sparql around
- tracker-extract might move within the sandbox
3.0.3.2. Cons
- Every graph must still follow the same ontology
- If host data is deleted (eg. tracker reset), the private database cannot be expected to be coherent.
- Beware of limits on the number of attached databases
3.0.3.3. ???
- Data isolation comes from miners, eg. music and photos would get distinct graphs, and applications would request access to those.
3.0.4. Sandboxing (Option 2)
Built upon SERVICE. tracker clients get a local store, queries across endpoints are done through SERVICE, eg:
SELECT ?a ?url ?d { SERVICE <dbus://org.freedesktop.Tracker.Miner.FS> { ?u a nmm:Photo ; nie:url ?url } . ?a foo:url ?url ; foo:data ?d }
Optionally clients might export themselves over DBus as a sparql endpoint, able to be queried on the outside, eg an hypothetical global search might do:
SELECT ?url { SERVICE <dbus://org.gnome.Music> { ?song nie:url ?url . fts:match "term" }UNION SERVICE <dbus://org.gnome.Photos> { ?photo nie:url ?url . fts:match "term" } }
Data becomes fully distributed (SPARQL's vision).
3.0.4.1. Pros
- Full freedom wrt ontologies, the sandbox application might have a custom ontologies and data, meshed together with tracker miners' nepomuk
- Updates are all kept within the sandbox, remote endpoints being readonly happen naturally from the sparql syntax.
3.0.4.2. Cons
- Settles on DBus for IPC with any other endpoint. Direct access is not as straightforward.
- Heavier sparql changes involved
- Although graphs might still be used to split data, access control might be left up to the dbus layer
- Needs some care to avoid breaking out into other endpoints from an authorized one, eg.
SELECT * { SERVICE <dbus://org.freedesktop.Tracker.Miner.FS> { SERVICE <dbus://org.gnome.Photos> { } } }
3.0.4.3. ???
- Although tracker-extract data might be within the sandbox, that would effectively lock the client on nepomuk ontology.