RDF export & SPARQL queries

SPARQL is a query language used to retrieve and manipulate data stored in Resource Description Framework (RDF) format. In this tutorial, we demonstrate how lamindb registries can be queried with SPARQL.

import warnings
warnings.filterwarnings("ignore")
!lamin load laminlabs/cellxgene
馃挕 connected lamindb: laminlabs/cellxgene
import bionty as bt

from rdflib import Graph, Literal, RDF, URIRef
馃挕 connected lamindb: laminlabs/cellxgene

Generally, we need to build a directed RDF Graph composed of triple statements. Such a graph statement is represented by:

  1. a node for the subject

  2. an arc that goes from a subject to an object for the predicate

  3. a node for the object.

Each of the three parts can be identified by a URI.

We can use the DataFrame representation of lamindb registries to build a RDF graph.

Building a RDF graph

diseases = bt.Disease.df()
diseases.head()
uid name ontology_id abbr synonyms description public_source_id run_id created_by_id updated_at
id
689 Me1FU1fo breast tumor luminal A or B MONDO:0004990 None breast tumor luminal|luminal breast cancer Subsets Of Breast Carcinoma Defined By Express... 49 None 1 2024-01-15 07:18:58.847956+00:00
688 rpYSjunF breast carcinoma by gene expression profile MONDO:0006116 None breast carcinoma by gene expression profile A Header Term That Includes The Following Brea... 49 None 1 2024-01-15 07:18:57.073042+00:00
687 1UsnNL28 Her2-receptor negative breast cancer MONDO:0000618 None None None 49 None 1 2024-01-15 07:18:55.811853+00:00
686 1FdMycA0 estrogen-receptor negative breast cancer MONDO:0006513 None ER- breast cancer A Subtype Of Breast Cancer That Is Estrogen-Re... 49 None 1 2024-01-15 07:18:55.811787+00:00
685 2OGAtYpX progesterone-receptor negative breast cancer MONDO:0000616 None None None 49 None 1 2024-01-15 07:18:55.811715+00:00

We convert the DataFrame to RDF by generating triples.

rdf_graph = Graph()

namespace = URIRef("http://sparql-example.org/")

for _, row in diseases.iterrows():
    subject = URIRef(namespace + str(row['ontology_id']))
    rdf_graph.add((subject, RDF.type, URIRef(namespace + "Disease")))
    rdf_graph.add((subject, URIRef(namespace + "name"), Literal(row['name'])))
    rdf_graph.add((subject, URIRef(namespace + "description"), Literal(row['description'])))

rdf_graph
<Graph identifier=Nf191ba65116b47d1a63cae01873c2246 (<class 'rdflib.graph.Graph'>)>

Now we can query the RDF graph using SPARQL for the name and associated description:

query = """
SELECT ?name ?description
WHERE {
  ?disease a <http://sparql-example.org/Disease> .
  ?disease <http://sparql-example.org/name> ?name .
  ?disease <http://sparql-example.org/description> ?description .
}
LIMIT 5
"""

for row in rdf_graph.query(query):
    print(f"Name: {row.name}, Description: {row.description}")
Name: breast tumor luminal A or B, Description: Subsets Of Breast Carcinoma Defined By Expression Of Genes Characteristic Of Luminal Epithelial Cells.
Name: breast carcinoma by gene expression profile, Description: A Header Term That Includes The Following Breast Carcinoma Subtypes Determined By Gene Expression Profiling: Luminal A Breast Carcinoma, Luminal B Breast Carcinoma, Her2 Positive Breast Carcinoma, Basal-Like Breast Carcinoma, Triple-Negative Breast Carcinoma, And Normal Breast-Like Subtype Of Breast Carcinoma.
Name: Her2-receptor negative breast cancer, Description: None
Name: estrogen-receptor negative breast cancer, Description: A Subtype Of Breast Cancer That Is Estrogen-Receptor Negative
Name: progesterone-receptor negative breast cancer, Description: None