Train a machine learning model on a collection¶

Here, we iterate over the artifacts within a collection to train a machine learning model at scale.

import lamindb as ln

💡 connected lamindb: testuser1/test-scrna

ln.settings.transform.stem_uid = "Qr1kIHvK506r"
ln.settings.transform.version = "1"
ln.track()

💡 notebook imports: lamindb==0.72.1 torch==2.3.0

💡 saved: Transform(uid='Qr1kIHvK506r5zKv', version='1', name='Train a machine learning model on a collection', key='scrna5', type='notebook', created_by_id=1, updated_at='2024-05-29 09:59:05 UTC')

💡 saved: Run(uid='8SNMx6l8hiiQlXS55jB2', transform_id=5, created_by_id=1)

Run(uid='8SNMx6l8hiiQlXS55jB2', started_at='2024-05-29 09:59:05 UTC', is_consecutive=True, transform_id=5, created_by_id=1)

Query our collection:

collection = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="2"
).one()
collection.describe()

Create a map-style dataset¶

Let us create a map-style dataset using using mapped(): a MappedCollection. This is what, for example, the PyTorch DataLoader expects as an input.

Under-the-hood, it performs a virtual inner join of the features of the underlying AnnData objects and thus allows to work with very large collections.

You can either perform a virtual inner join:

with collection.mapped(obs_keys=["cell_type"], join="inner") as dataset:
    print(len(dataset.var_joint))

Or a virtual outer join:

dataset = collection.mapped(obs_keys=["cell_type"], join="outer")

len(dataset.var_joint)

This is compatible with a PyTorch DataLoader because it implements __getitem__ over a list of backed AnnData objects. The 5th cell in the collection can be accessed like:

dataset[5]

The labels are encoded into integers:

dataset.encoders

Create a pytorch DataLoader¶

Let us use a weighted sampler:

from torch.utils.data import DataLoader, WeightedRandomSampler

# label_key for weight doesn't have to be in labels on init
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("cell_type"), num_samples=len(dataset)
)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler)

We can now iterate through the data loader:

for batch in dataloader:
    pass

Close the connections in MappedCollection:

dataset.close()