nucleus.dataset

Dataset

Datasets are collections of your data that can be associated with models.

ObjectQueryType

str(object='') -> str

class nucleus.dataset.Dataset(dataset_id, client, name=None, is_scene=None, use_privacy_mode=None)

Datasets are collections of your data that can be associated with models.

You can append DatasetItems or Scenes with metadata to your dataset, annotate it with ground truth, and upload model predictions to evaluate and compare model performance on your data.

Make sure that the dataset is set up correctly supporting the required datatype (see code sample below).

Datasets cannot be instantiated directly and instead must be created via API endpoint using NucleusClient.create_dataset(), or in the dashboard.

import nucleus

client = nucleus.NucleusClient(YOUR_SCALE_API_KEY)

# Create new dataset supporting DatasetItems
dataset = client.create_dataset(YOUR_DATASET_NAME, is_scene=False)

# OR create new dataset supporting LidarScenes
dataset = client.create_dataset(YOUR_DATASET_NAME, is_scene=True)

# Or, retrieve existing dataset by ID
# This ID can be fetched using client.list_datasets() or from a dashboard URL
existing_dataset = client.get_dataset("YOUR_DATASET_ID")
Parameters:

client (nucleus.NucleusClient)

add_items_from_dir(dirname=None, existing_dirname=None, privacy_mode_proxy='', allowed_file_types=('png', 'jpg', 'jpeg'), skip_size_warning=False, update_items=False)

Update dataset by recursively crawling through a directory. A DatasetItem will be created for each unique image found. The existing items are skipped or updated depending on update_items param

Parameters:
  • dirname (Optional[str]) – Where to look for image files, recursively

  • existing_dirname (Optional[str]) – Already validated dirname

  • privacy_mode_proxy (str) – Endpoint that serves image files for privacy mode, ignore if not using privacy mode. The proxy should work based on the relative path of the images in the directory.

  • allowed_file_types (Tuple[str, Ellipsis]) – Which file type extensions to search for, ie: (‘jpg’, ‘png’)

  • skip_size_warning (bool) – If False, it will throw an error if the script globs more than 500 images. This is a safety check in case the dirname has a typo, and grabs too much data.

  • update_items (bool) – Whether to update items in existing dataset

add_taxonomy(taxonomy_name, taxonomy_type, labels, update=False)

Creates a new taxonomy.

At the moment we only support taxonomies for category annotations and predictions.

import nucleus
client = nucleus.NucleusClient("YOUR_SCALE_API_KEY")
dataset = client.get_dataset("ds_bwkezj6g5c4g05gqp1eg")

response = dataset.add_taxonomy(
    taxonomy_name="clothing_type",
    taxonomy_type="category",
    labels=["shirt", "trousers", "dress"],
    update=False
)
Parameters:
  • taxonomy_name (str) – The name of the taxonomy. Taxonomy names must be unique within a dataset.

  • taxonomy_type (str) – The type of this taxonomy as a string literal. Currently, the only supported taxonomy type is “category.”

  • labels (List[str]) – The list of possible labels for the taxonomy.

  • update (bool) – Whether or not to update taxonomy labels on taxonomy name collision. Default is False. Note that taxonomy labels will not be deleted on update, they can only be appended.

Returns:

Returns a response with dataset_id, taxonomy_name, and status of the add taxonomy operation.

{
    "dataset_id": str,
    "taxonomy_name": str,
    "status": "Taxonomy created"
}

annotate(annotations, update=DEFAULT_ANNOTATION_UPDATE_MODE, batch_size=5000, asynchronous=False, remote_files_per_upload_request=20, local_files_per_upload_request=10)

Uploads ground truth annotations to the dataset.

Adding ground truth to your dataset in Nucleus allows you to visualize annotations, query dataset items based on the annotations they contain, and evaluate models by comparing their predictions to ground truth.

Nucleus supports Box, Polygon, Cuboid, Segmentation, Category, and Category annotations. Cuboid annotations can only be uploaded to a pointcloud DatasetItem.

When uploading an annotation, you need to specify which item you are annotating via the reference_id you provided when uploading the image or pointcloud.

Ground truth uploads can be made idempotent by specifying an optional annotation_id for each annotation. This id should be unique within the dataset_item so that (reference_id, annotation_id) is unique within the dataset.

See SegmentationAnnotation for specific requirements to upload segmentation annotations.

For ingesting large annotation payloads, see the Guide for Large Ingestions.

Parameters:
  • annotations (Sequence[Annotation]) – List of annotation objects to upload.

  • update (bool) – Whether to ignore or overwrite metadata for conflicting annotations.

  • batch_size (int) – Number of annotations processed in each concurrent batch. Default is 5000. If you get timeouts when uploading geometric annotations, you can try lowering this batch size.

  • asynchronous (bool) – Whether or not to process the upload asynchronously (and return an AsyncJob object). Default is False.

  • remote_files_per_upload_request (int) – Number of remote files to upload in each request. Segmentations have either local or remote files, if you are getting timeouts while uploading segmentations with remote urls, you should lower this value from its default of 20.

  • local_files_per_upload_request (int) – Number of local files to upload in each request. Segmentations have either local or remote files, if you are getting timeouts while uploading segmentations with local files, you should lower this value from its default of 10. The maximum is 10.

Returns:

If synchronous, payload describing the upload result:

{
    "dataset_id": str,
    "annotations_processed": int
}

Otherwise, returns an AsyncJob object.

Return type:

Union[Dict[str, Any], nucleus.async_job.AsyncJob]

append(items, update=False, batch_size=20, asynchronous=False, local_files_per_upload_request=10)

Appends items or scenes to a dataset.

Note

Datasets can only accept one of DatasetItems or Scenes, never both.

This behavior is set during Dataset creation with the is_scene flag.

import nucleus

client = nucleus.NucleusClient("YOUR_SCALE_API_KEY")
dataset = client.get_dataset("YOUR_DATASET_ID")

local_item = nucleus.DatasetItem(
  image_location="./1.jpg",
  reference_id="image_1",
  metadata={"key": "value"}
)
remote_item = nucleus.DatasetItem(
  image_location="s3://your-bucket/2.jpg",
  reference_id="image_2",
  metadata={"key": "value"}
)

# default is synchronous upload
sync_response = dataset.append(items=[local_item])

# async jobs have higher throughput but can be more difficult to debug
async_job = dataset.append(
  items=[remote_item], # all items must be remote for async
  asynchronous=True
)
print(async_job.status())

A Dataset can be populated with labeled and unlabeled data. Using Nucleus, you can filter down the data inside your dataset using custom metadata about your images.

For instance, your local dataset may contain Sunny, Foggy, and Rainy folders of images. All of these images can be uploaded into a single Nucleus Dataset, with (queryable) metadata like {"weather": "Sunny"}.

To update an item’s metadata, you can re-ingest the same items with the update argument set to true. Existing metadata will be overwritten for DatasetItems in the payload that share a reference_id with a previously uploaded DatasetItem. To retrieve your existing reference_ids, use Dataset.items().

# overwrite metadata by reuploading the item
remote_item.metadata["weather"] = "Sunny"

async_job_2 = dataset.append(
  items=[remote_item],
  update=True,
  asynchronous=True
)
Parameters:
  • items (Union[Sequence[nucleus.dataset_item.DatasetItem], Sequence[nucleus.scene.LidarScene], Sequence[nucleus.scene.VideoScene]]) – ( Union[ Sequence[DatasetItem], Sequence[LidarScene] Sequence[VideoScene] ]): List of items or scenes to upload.

  • batch_size (int) – Size of the batch for larger uploads. Default is 20. This is for items that have a remote URL and do not require a local upload. If you get timeouts for uploading remote urls, try decreasing this.

  • update (bool) – Whether or not to overwrite metadata on reference ID collision. Default is False.

  • asynchronous (bool) – Whether or not to process the upload asynchronously (and return an AsyncJob object). This is required when uploading scenes. Default is False.

  • files_per_upload_request – Optional; default is 10. We recommend lowering this if you encounter timeouts.

  • local_files_per_upload_request (int) – Optional; default is 10. We recommend lowering this if you encounter timeouts.

Returns:

For scenes

If synchronous, returns a payload describing the upload result:

{
    "dataset_id: str,
    "new_items": int,
    "updated_items": int,
    "ignored_items": int,
    "upload_errors": int
}

Otherwise, returns an AsyncJob object.

For images

If synchronous returns nucleus.upload_response.UploadResponse otherwise AsyncJob

Return type:

Union[Dict[Any, Any], nucleus.async_job.AsyncJob, nucleus.upload_response.UploadResponse]

autotag_items(autotag_name, for_scores_greater_than=0)

Fetches the autotag’s items above the score threshold, sorted by descending score.

Parameters:
  • autotag_name – The user-defined name of the autotag.

  • for_scores_greater_than (Optional[int]) – Score threshold between -1 and 1 above which to include autotag items.

Returns:

List of autotagged items above the given score threshold, sorted by descending score, and autotag info, packaged into a dict as follows:

{
    "autotagItems": List[{
        ref_id: str,
        score: float,
        model_prediction_annotation_id: str | None
        ground_truth_annotation_id: str | None,
    }],
    "autotag": {
        id: str,
        name: str,
        status: "started" | "completed",
        autotag_level: "Image" | "Object"
    }
}

Note model_prediction_annotation_id and ground_truth_annotation_id are only relevant for object autotags.

autotag_training_items(autotag_name)

Fetches items that were manually selected during refinement of the autotag.

Parameters:

autotag_name – The user-defined name of the autotag.

Returns:

List of user-selected positives and autotag info, packaged into a dict as follows:

{
    "autotagPositiveTrainingItems": List[{
        ref_id: str,
        model_prediction_annotation_id: str | None,
        ground_truth_annotation_id: str | None,
    }],
    "autotag": {
        id: str,
        name: str,
        status: "started" | "completed",
        autotag_level: "Image" | "Object"
    }
}

Note model_prediction_annotation_id and ground_truth_annotation_id are only relevant for object autotags.

build_slice(name, sample_size, sample_method, filters=None)

Build a slice using Nucleus’ Smart Sample tool. Allowing slices to be built based on certain criteria, and filters.

Parameters:
  • name (str) – Name for the slice being created. Must be unique per dataset.

  • sample_size (int) – Size of the slice to create. Capped by the size of the dataset and the applied filters.

  • sample_method (Union[str, nucleus.slice.SliceBuilderMethods]) – How to sample the dataset, currently supports ‘Random’ and ‘Uniqueness’

  • filters (Optional[nucleus.slice.SliceBuilderFilters]) – Apply filters to only sample from an existing slice or autotag

Return type:

Union[str, Tuple[nucleus.async_job.AsyncJob, str], dict]

Examples

from nucleus.slice import SliceBuilderFilters, SliceBuilderMethods, SliceBuilderFilterAutotag

# random slice job = dataset.build_slice(“RandomSlice”, 20, SliceBuilderMethods.RANDOM)

# slice with filters filters = SliceBuilderFilters(

slice_id=”<some slice id>”, autotag=SliceBuilderFilterAutotag(“tag_cd41jhjdqyti07h8m1n1”, [-0.5, 0.5])

) job = dataset.build_slice(“NewSlice”, 20, SliceBuilderMethods.RANDOM, filters)

Returns: An async job

calculate_evaluation_metrics(model, options=None)

Starts computation of evaluation metrics for a model on the dataset.

To update matches and metrics calculated for a model on a given dataset you can call this endpoint. This is required in order to sort by IOU, view false positives/false negatives, and view model insights.

You can add predictions from a model to a dataset after running the calculation of the metrics. However, the calculation of metrics will have to be retriggered for the new predictions to be matched with ground truth and appear as false positives/negatives, or for the new predictions effect on metrics to be reflected in model run insights.

During IoU calculation, bounding box Predictions are compared to GroundTruth using a greedy matching algorithm that matches prediction and ground truth boxes that have the highest ious first. By default the matching algorithm is class-agnostic: it will greedily create matches regardless of the class labels.

The algorithm can be tuned to classify true positives between certain classes, but not others. This is useful if the labels in your ground truth do not match the exact strings of your model predictions, or if you want to associate multiple predictions with one ground truth label, or multiple ground truth labels with one prediction. To recompute metrics based on different matching, you can re-commit the run with new request parameters.

import nucleus

client = nucleus.NucleusClient("YOUR_SCALE_API_KEY")
dataset = client.get_dataset(dataset_id="YOUR_DATASET_ID")

model = client.get_model(model_id="YOUR_MODEL_PRJ_ID")

# Compute all evaluation metrics including IOU-based matching:
dataset.calculate_evaluation_metrics(model)

# Match car and bus bounding boxes (for IOU computation)
# Otherwise enforce that class labels must match
dataset.calculate_evaluation_metrics(model, options={
  'allowed_label_matches': [
    {
      'ground_truth_label': 'car',
      'model_prediction_label': 'bus'
    },
    {
      'ground_truth_label': 'bus',
      'model_prediction_label': 'car'
    }
  ]
})
Parameters:
  • model (Model) – The model object for which to calculate metrics.

  • options (Optional[dict]) –

    Dictionary of specific options to configure metrics calculation.

    class_agnostic

    Whether ground truth and prediction classes can differ when being matched for evaluation metrics. Default is True.

    allowed_label_matches

    Pairs of ground truth and prediction classes that should be considered matchable when computing metrics. If supplied, class_agnostic must be False.

    {
        "class_agnostic": bool,
        "allowed_label_matches": List[{
            "ground_truth_label": str,
            "model_prediction_label": str
        }]
    }
    

create_custom_index(embeddings_urls, embedding_dim)

Processes user-provided embeddings for the dataset to use with autotag and simsearch.

import nucleus

client = nucleus.NucleusClient("YOUR_SCALE_API_KEY")
dataset = client.get_dataset("YOUR_DATASET_ID")

all_embeddings = {
    "reference_id_0": [0.1, 0.2, 0.3],
    "reference_id_1": [0.4, 0.5, 0.6],
    ...
    "reference_id_10000": [0.7, 0.8, 0.9]
} # sharded and uploaded to s3 with the two below URLs

embeddings_url_1 = "s3://dataset/embeddings_map_1.json"
embeddings_url_2 = "s3://dataset/embeddings_map_2.json"

response = dataset.create_custom_index(
    embeddings_url=[embeddings_url_1, embeddings_url_2],
    embedding_dim=3
)
Parameters:
  • embeddings_urls (List[str]) – List of URLs, each of which pointing to a JSON mapping reference_id -> embedding vector. Each embedding JSON must contain <5000 rows.

  • embedding_dim (int) – The dimension of the embedding vectors. Must be consistent across all embedding vectors in the index.

Returns:

Asynchronous job object to track processing status.

Return type:

AsyncJob

create_image_index()

Creates or updates image index by generating embeddings for images that do not already have embeddings.

The embeddings are used for autotag and similarity search.

This endpoint is limited to index up to 2 million images at a time and the job will fail for payloads that exceed this limit.

Returns:

Asynchronous job object to track processing status.

Return type:

AsyncJob

create_object_index(model_run_id=None, gt_only=None)

Creates or updates object index by generating embeddings for objects that do not already have embeddings.

These embeddings are used for autotag and similarity search. This endpoint only supports indexing objects sourced from the predictions of a specific model or the ground truth annotations of the dataset.

This endpoint is idempotent. If this endpoint is called again for a model whose predictions were indexed in the past, the previously indexed predictions will not have new embeddings recomputed. The same is true for ground truth annotations.

Note that this means if you change update a prediction or ground truth bounding box that already has an associated embedding, the embedding will not be updated, even with another call to this endpoint. For now, we recommend deleting the prediction or ground truth annotation and re-inserting it to force generate a new embedding.

This endpoint is limited to generating embeddings for 3 million objects at a time and the job will fail for payloads that exceed this limit.

Parameters:
  • model_run_id (Optional[str]) –

    The ID of the model whose predictions should be indexed. Default is None, but must be supplied in the absence of gt_only.

  • gt_only (Optional[bool]) – Whether to only generate embeddings for the ground truth annotations of the dataset. Default is None, but must be supplied in the absence of model_run_id.

Returns:

Asynchronous job object to track processing status.

Return type:

AsyncJob

create_slice(name, reference_ids=None)

Creates a Slice of dataset items within a dataset.

Parameters:
  • name (str) – A human-readable name for the slice.

  • reference_ids (Optional[List[str]]) – List of reference IDs of dataset items to add to the slice, cannot exceed 10,000 items. Can be left unspecified, and an empty slice will be created.

Returns:

The newly constructed slice item.

Return type:

Slice

Raises:

BadRequest – If length of reference_ids is too large (> 10,000 items)

create_slice_by_ids(name, dataset_item_ids=None, scene_ids=None, annotation_ids=None, prediction_ids=None)

Creates a Slice of dataset items, scenes, annotations, or predictions within a dataset by their IDs.

Note

Dataset item, scene, and object (annotation or prediction) IDs may not be mixed. However, when creating an object slice, both annotation and prediction IDs may be supplied.

Parameters:
  • name (str) – A human-readable name for the slice.

  • dataset_item_ids (Optional[List[str]]) – List of internal IDs of dataset items to add to the slice:

  • scene_ids (Optional[List[str]]) – List of internal IDs of scenes to add to the slice:

  • annotation_ids (Optional[List[str]]) – List of internal IDs of Annotations to add to the slice:

  • prediction_ids (Optional[List[str]]) – List of internal IDs of Predictions to add to the slice:

Returns:

The newly constructed slice item.

Return type:

Slice

delete_annotations(reference_ids=None, keep_history=True)

Deletes all annotations associated with the specified item reference IDs.

Parameters:
  • reference_ids (Optional[list]) – List of user-defined reference IDs of the dataset items from which to delete annotations. Defaults to an empty list.

  • keep_history (bool) – Whether to preserve version history. We recommend skipping this parameter and using the default value of True.

Returns:

Empty payload response.

Return type:

AsyncJob

delete_custom_index(image=True)

Deletes the custom index uploaded to the dataset.

Returns:

Payload containing information that can be used to track the job’s status:

{
    "dataset_id": str,
    "job_id": str,
    "message": str
}

Parameters:

image (bool)

delete_item(reference_id)

Deletes an item from the dataset by item reference ID.

All annotations and predictions associated with the item will be deleted as well.

Parameters:

reference_id (str) – The user-defined reference ID of the item to delete.

Returns:

Payload to indicate deletion invocation.

Return type:

dict

delete_scene(reference_id)

Deletes a sene from the Dataset by scene reference ID.

All items, annotations, and predictions associated with the scene will be deleted as well.

Parameters:

reference_id (str) – The user-defined reference ID of the item to delete.

delete_taxonomy(taxonomy_name)

Deletes the given taxonomy.

All annotations and predictions associated with the taxonomy will be deleted as well.

Parameters:

taxonomy_name (str) – The name of the taxonomy.

Returns:

Returns a response with dataset_id, taxonomy_name, and status of the delete taxonomy operation.

{
    "dataset_id": str,
    "taxonomy_name": str,
    "status": "Taxonomy successfully deleted"
}

delete_tracks(track_reference_ids)

Deletes a list of tracks from the dataset, thereby unlinking their annotation and prediction instances.

Parameters:
  • reference_ids (List[str]) – A list of reference IDs for tracks to delete.

  • track_reference_ids (List[str])

Return type:

None

export_embeddings(asynchronous=True)

Fetches a pd.DataFrame-ready list of dataset embeddings.

Parameters:

asynchronous (bool) – Whether or not to process the export asynchronously (and return an EmbeddingsExportJob object). Default is True.

Returns:

If synchronous, a list where each item is a dict with two keys representing a row in the dataset:

List[{
    "reference_id": str,
    "embedding_vector": List[float]
}]

Otherwise, returns an EmbeddingsExportJob object.

Return type:

Union[List[Dict[str, Union[str, List[float]]]], nucleus.async_job.EmbeddingsExportJob]

export_predictions(model)

Fetches all predictions of a model that were uploaded to the dataset.

Parameters:

model (Model) – The model whose predictions to retrieve.

Returns:

List of prediction objects from the model.

Return type:

List[Union[ BoxPrediction, PolygonPrediction, CuboidPrediction, SegmentationPrediction CategoryPrediction, KeypointsPrediction, ]]

export_scale_task_info()

Fetches info for all linked Scale tasks of items/scenes in the dataset.

Returns:

A list of dicts, each with two keys, respectively mapping to items/scenes and info on their corresponding Scale tasks within the dataset:

List[{
    "item" | "scene": Union[:class:`DatasetItem`, :class:`Scene`],
    "scale_task_info": {
        "task_id": str,
        "task_status": str,
        "task_audit_status": str,
        "task_audit_review_comment": Optional[str],
        "project_name": str,
        "batch": str,
        "created_at": str,
        "completed_at": Optional[str]
    }[]
}]

get_image_indexing_status()

Gets the primary image index progress for the dataset.

Returns:

Response payload:

{
    "embedding_count": int
    "image_count": int
    "percent_indexed": float
    "additional_context": str
}

get_object_indexing_status(model_run_id=None)

Gets the primary object index progress of the dataset. If model_run_id is not specified, this endpoint will retrieve the indexing progress of the ground truth objects.

Returns:

Response payload:

{
    "embedding_count": int
    "object_count": int
    "percent_indexed": float
    "additional_context": str
}

get_scene(reference_id)

Fetches a single scene in the dataset by its reference ID.

Parameters:

reference_id (str) – The user-defined reference ID of the scene to fetch.

Returns:

A scene object containing frames, which in turn contain pointcloud or image items.

Return type:

Scene

get_scene_from_item_ref_id(item_reference_id)

Given a dataset item reference ID, find the Scene it belongs to.

Parameters:

item_reference_id (str)

Return type:

Optional[nucleus.scene.Scene]

get_slices(name=None, slice_type=None)

Get a list of slices from its name or underlying slice type.

Parameters:
  • name (Optional[str]) – Name of the desired slice to look up.

  • slice_type (Optional[Union[str, nucleus.slice.SliceType]]) – Type of slice to look up. This can be one of (‘dataset_item’, ‘object’, ‘scene’)

Raises:

NotFound if no slice(s) were found with the given criteria

Returns:

The Nucleus slice as an object.

Return type:

Slice

ground_truth_loc(reference_id, annotation_id)

Fetches a single ground truth annotation by ID.

Parameters:
  • reference_id (str) – User-defined reference ID of the dataset item associated with the ground truth annotation.

  • annotation_id (str) – User-defined ID of the ground truth annotation.

Returns:

Ground truth annotation object with the specified annotation ID.

Return type:

Union[ BoxAnnotation, LineAnnotation, PolygonAnnotation, KeypointsAnnotation, CuboidAnnotation, SegmentationAnnotation CategoryAnnotation ]

iloc(i)

Fetches dataset item and associated annotations by absolute numerical index.

Parameters:

i (int) – Absolute numerical index of the dataset item within the dataset.

Returns:

Payload describing the dataset item and associated annotations:

{
    "item": DatasetItem
    "annotations": {
        "box": Optional[List[BoxAnnotation]],
        "cuboid": Optional[List[CuboidAnnotation]],
        "line": Optional[List[LineAnnotation]],
        "polygon": Optional[List[PolygonAnnotation]],
        "keypoints": Optional[List[KeypointsAnnotation]],
        "segmentation": Optional[List[SegmentationAnnotation]],
        "category": Optional[List[CategoryAnnotation]],
    }
}

Return type:

dict

info()

Fetches information about the dataset.

Returns:

Information about the dataset including its Scale-generated ID, name, length, associated Models, Slices, and more.

Return type:

DatasetInfo

ingest_tasks(task_ids)

Ingest specific tasks from an existing Scale or Rapid project into the dataset.

Note: if you would like to create a new Dataset from an exisiting Scale labeling project, use NucleusClient.create_dataset_from_project().

For more info, see our Ingest From Labeling Guide.

Parameters:

task_ids (List[str]) – List of task IDs to ingest.

Returns:

Payload describing the asynchronous upload result:

{
    "ingested_tasks": int,
    "ignored_tasks": int,
    "pending_tasks": int
}

Return type:

dict

items_and_annotation_chip_generator(chip_size, stride_size, cache_directory, query=None, num_processes=0)

Provides a generator of chips for all DatasetItems and BoxAnnotations in the dataset.

A chip is an image created by tiling a source image.

Parameters:
  • chip_size (int) – The size of the image chip

  • stride_size (int) – The distance to move when creating the next image chip. When stride is equal to chip size, there will be no overlap. When stride is equal to half the chip size, there will be 50 percent overlap.

  • cache_directory (str) – The s3 or local directory to store the image and annotations of a chip. s3 directories must be in the format s3://s3-bucket/s3-key

  • query (Optional[str]) – Structured query compatible with the Nucleus query language.

  • num_processes (int) – The number of worker processes to use to chip and upload images. If unset, no parallel processing will occur.

Returns:

Generator where each element is a dict containing the location of the image chip (jpeg) and its annotations (json).

Iterable[{
    "image_location": str,
    "annotation_location": str
}]

Return type:

Iterable[Dict[str, str]]

items_and_annotation_generator(query=None, use_mirrored_images=False)

Provides a generator of all DatasetItems and Annotations in the dataset.

Parameters:
  • query (Optional[str]) –

    Structured query compatible with the Nucleus query language.

  • use_mirrored_images (bool) – If True, returns the location of the mirrored image hosted in Scale S3. Useful when the original image is no longer available.

Returns:

Generator where each element is a dict containing the DatasetItem and all of its associated Annotations, grouped by type.

Iterable[{
    "item": DatasetItem,
    "annotations": {
        "box": List[BoxAnnotation],
        "polygon": List[PolygonAnnotation],
        "cuboid": List[CuboidAnnotation],
        "line": Optional[List[LineAnnotation]],
        "segmentation": List[SegmentationAnnotation],
        "category": List[CategoryAnnotation],
        "keypoints": List[KeypointsAnnotation],
    }
}]

Return type:

Iterable[Dict[str, Union[nucleus.dataset_item.DatasetItem, Dict[str, List[nucleus.annotation.Annotation]]]]]

items_and_annotations()

Returns a list of all DatasetItems and Annotations in this dataset.

Returns:

A list of dicts, each with two keys representing a row in the dataset:

List[{
    "item": DatasetItem,
    "annotations": {
        "box": Optional[List[BoxAnnotation]],
        "cuboid": Optional[List[CuboidAnnotation]],
        "line": Optional[List[LineAnnotation]],
        "polygon": Optional[List[PolygonAnnotation]],
        "segmentation": Optional[List[SegmentationAnnotation]],
        "category": Optional[List[CategoryAnnotation]],
        "keypoints": Optional[List[KeypointsAnnotation]],
    }
}]

Return type:

List[Dict[str, Union[nucleus.dataset_item.DatasetItem, Dict[str, List[nucleus.annotation.Annotation]]]]]

items_generator(page_size=100000)

Generator yielding all dataset items in the dataset.

collected_ref_ids = []
for item in dataset.items_generator():
    print(f"Exporting item: {item.reference_id}")
    collected_ref_ids.append(item.reference_id)
Parameters:

page_size (int, optional) – Number of items to return per page. If you are experiencing timeouts while using this generator, you can try lowering the page size.

Yields:

DatasetItem – A single DatasetItem object.

Return type:

Iterable[nucleus.dataset_item.DatasetItem]

jobs(job_types=None, from_date=None, to_date=None, limit=JOB_REQ_LIMIT, show_completed=False, stats_only=False)

Fetch jobs pertaining to this particular dataset.

Parameters:
  • job_types (Optional[List[nucleus.job.CustomerJobTypes]]) – Filter on set of job types, if None, fetch all types, ie: [‘uploadDatasetItems’]

  • from_date (Optional[Union[str, datetime.datetime]]) – beginning of date range, as a string ‘YYYY-MM-DD’ or datetime object. For example: ‘2021-11-05’, parser.parse(‘Nov 5 2021’), or datetime(2021,11,5)

  • to_date (Optional[Union[str, datetime.datetime]]) – end of date range

  • limit (int) – number of results to fetch, max 50_000

  • show_completed (bool) – dont fetch jobs with Completed status

  • stats_only (bool) – return overview of jobs, instead of a list of job objects

list_autotags()

Fetches all autotags of the dataset.

Returns:

List of autotag payloads:

List[{
    "id": str,
    "name": str,
    "status": "completed" | "pending",
    "autotag_level": "Image" | "Object"
}]

loc(dataset_item_id)

Fetches a dataset item and associated annotations by Nucleus-generated ID.

Parameters:

dataset_item_id (str) – Nucleus-generated dataset item ID (starts with di_). This can be retrieved via Dataset.items() or a Nucleus dashboard URL.

Returns:

Payload containing the dataset item and associated annotations:

{
    "item": DatasetItem
    "annotations": {
        "box": Optional[List[BoxAnnotation]],
        "cuboid": Optional[List[CuboidAnnotation]],
        "line": Optional[List[LineAnnotation]],
        "polygon": Optional[List[PolygonAnnotation]],
        "keypoints": Optional[List[KeypointsAnnotation]],
        "segmentation": Optional[List[SegmentationAnnotation]],
        "category": Optional[List[CategoryAnnotation]],
    }
}

Return type:

dict

prediction_loc(model, reference_id, annotation_id)

Fetches a single ground truth annotation by id.

Parameters:
  • model (Model) – Model object from which to fetch the prediction.

  • reference_id (str) – User-defined reference ID of the dataset item associated with the model prediction.

  • annotation_id (str) – User-defined ID of the ground truth annotation.

Returns:

Model prediction object with the specified annotation ID.

Return type:

Union[ BoxPrediction, PolygonPrediction, CuboidPrediction, SegmentationPrediction CategoryPrediction KeypointsPrediction ]

predictions_iloc(model, index)

Fetches all predictions of a dataset item by its absolute index.

Parameters:
  • model (Model) – Model object from which to fetch the prediction.

  • index (int) – Absolute index of the dataset item within the dataset.

Returns:

Dictionary mapping prediction type to a list of such prediction objects from the given model:

{
    "box": List[BoxPrediction],
    "polygon": List[PolygonPrediction],
    "cuboid": List[CuboidPrediction],
    "segmentation": List[SegmentationPrediction],
    "category": List[CategoryPrediction],
    "keypoints": List[KeypointsPrediction],
}

Return type:

List[Union[ BoxPrediction, PolygonPrediction, CuboidPrediction, SegmentationPrediction CategoryPrediction, KeypointsPrediction, ]]

predictions_refloc(model, reference_id)

Fetches all predictions of a dataset item by its reference ID.

Parameters:
  • model (Model) – Model object from which to fetch the prediction.

  • reference_id (str) – User-defined ID of the dataset item from which to fetch all predictions.

Returns:

Dictionary mapping prediction type to a list of such prediction objects from the given model:

{
    "box": List[BoxPrediction],
    "polygon": List[PolygonPrediction],
    "cuboid": List[CuboidPrediction],
    "segmentation": List[SegmentationPrediction],
    "category": List[CategoryPrediction],
    "keypoints": List[KeypointsPrediction],
}

Return type:

List[Union[ BoxPrediction, PolygonPrediction, CuboidPrediction, SegmentationPrediction CategoryPrediction, KeypointsPrediction, ]]

query_items(query)

Fetches all DatasetItems that pertain to a given structured query.

Parameters:

query (str) –

Structured query compatible with the Nucleus query language.

Returns:

A list of DatasetItem query results.

Return type:

Iterable[nucleus.dataset_item.DatasetItem]

query_objects(query, query_type, model_run_id=None)

Fetches all objects in the dataset that pertain to a given structured query. The results are either Predictions, Annotations, or Evaluation Matches, based on the objectType input parameter

Parameters:
  • query (str) –

    Structured query compatible with the Nucleus query language.

  • objectType – Defines the type of the object to query

  • query_type (ObjectQueryType)

  • model_run_id (Optional[str])

Returns:

An iterable of either Predictions, Annotations, or Evaluation Matches

Return type:

Iterable[Union[nucleus.annotation.Annotation, nucleus.prediction.Prediction, nucleus.evaluation_match.EvaluationMatch]]

query_scenes(query)

Fetches all Scenes that pertain to a given structured query.

Parameters:

query (str) –

Structured query compatible with the Nucleus query language.

Returns:

A list of Scene query results.

Return type:

Iterable[nucleus.scene.Scene]

refloc(reference_id)

Fetches a dataset item and associated annotations by reference ID.

Parameters:

reference_id (str) – User-defined reference ID of the dataset item.

Returns:

Payload containing the dataset item and associated annotations:

{
    "item": DatasetItem
    "annotations": {
        "box": Optional[List[BoxAnnotation]],
        "cuboid": Optional[List[CuboidAnnotation]],
        "line": Optional[List[LineAnnotation]],
        "polygon": Optional[List[PolygonAnnotation]],
        "keypoints": Option[List[KeypointsAnnotation]],
        "segmentation": Optional[List[SegmentationAnnotation]],
        "category": Optional[List[CategoryAnnotation]],
    }
}

Return type:

dict

set_continuous_indexing(enable=True)

Toggle whether embeddings are automatically generated for new data.

Sets continuous indexing for a given dataset, which will automatically generate embeddings for use with autotag whenever new images are uploaded.

Parameters:

enable (bool) – Whether to enable or disable continuous indexing. Default is True.

Returns:

Response payload:

{
    "dataset_id": str,
    "message": str
    "backfill_job": AsyncJob,
}

set_primary_index(image=True, custom=False)

Sets the primary index used for Autotag and Similarity Search on this dataset.

Parameters:
  • image (bool) – Whether to configure the primary index for images or objects. Default is True (set primary image index).

  • custom (bool) – Whether to set the primary index to use custom or Nucleus-generated embeddings. Default is True (use custom embeddings as the primary index).

Returns:

{

“success”: bool,

}

update_autotag(autotag_id)

Rerun autotag inference on all items in the dataset.

Currently this endpoint does not try to skip already inferenced items, but this improvement is planned for the future. This means that for now, you can only have one job running at a time, so please await the result using job.sleep_until_complete() before launching another job.

Parameters:

autotag_id (str) – ID of the autotag to re-inference. You can retrieve the ID you want with list_autotags(), or from its URL in the “Manage Autotags” page in the dashboard.

Returns:

Asynchronous job object to track processing status.

Return type:

AsyncJob

update_item_metadata(mapping, asynchronous=False)

Update (merge) dataset item metadata for each reference_id given in the mapping. The backend will join the specified mapping metadata to the existing metadata. If there is a key-collision, the value given in the mapping will take precedence.

This method may also be used to udpate the camera_params for a particular set of items. Just specify the key camera_params in the metadata for each reference_id along with all the necessary fields.

Parameters:
  • mapping (Dict[str, dict]) – key-value pair of <reference_id>: <metadata>

  • asynchronous (bool) – if True, run the update as a background job

Examples

>>> mapping = {"item_ref_1": {"new_key": "foo"}, "item_ref_2": {"some_value": 123, "camera_params": {...}}}
>>> dataset.update_item_metadata(mapping)
Returns:

A dictionary outlining success or failures.

Parameters:
  • mapping (Dict[str, dict])

  • asynchronous (bool)

update_scene_metadata(mapping, asynchronous=False)

Update (merge) scene metadata for each reference_id given in the mapping. The backend will join the specified mapping metadata to the existing metadata. If there is a key-collision, the value given in the mapping will take precedence.

Parameters:
  • mapping (Dict[str, dict]) – key-value pair of <reference_id>: <metadata>

  • asynchronous (bool) – if True, run the update as a background job

Examples

>>> mapping = {"scene_ref_1": {"new_key": "foo"}, "scene_ref_2": {"some_value": 123}}
>>> dataset.update_scene_metadata(mapping)
Returns:

A dictionary outlining success or failures.

Parameters:
  • mapping (Dict[str, dict])

  • asynchronous (bool)

upload_lidar_semseg_predictions(model, pointcloud_ref_id, predictions_s3_path)

Upload Lidar Semantic Segmentation predictions for a given point-cloud.

Assuming a point-cloud with only 4 points (three labeled as Car, one labeled as Person), the contents of the predictions s3 object should be formatted as such:

{
    "objects": [
        { "label": "Car", "index": 1},
        { "label": "Person", "index": 2}
    ],
    "point_objects": [1, 1, 1, 2],
    "point_confidence": [0.5, 0.9, 0.9, 0.3]
}

The order of the points in the “point_objects” should be in the same order as the points that were originally uploaded to Scale.

Parameters:
  • model (Model) – Nucleus model used to store these predictions

  • pointcloud_ref_id (str) – The reference ID of the pointcloud for which these predictions belong to

  • predictions_s3_path (str) – S3 path to where the predictions are stored

upload_predictions(model, predictions, update=False, asynchronous=False, batch_size=5000, remote_files_per_upload_request=20, local_files_per_upload_request=10, trained_slice_id=None)

Uploads predictions and associates them with an existing Model.

Adding predictions to your dataset in Nucleus allows you to visualize discrepancies against ground truth, query dataset items based on the predictions they contain, and evaluate your models by comparing their predictions to ground truth.

Nucleus supports Box, Polygon, Cuboid, Segmentation, Category, and Category predictions. Cuboid predictions can only be uploaded to a pointcloud DatasetItem.

When uploading a prediction, you need to specify which item you are annotating via the reference_id you provided when uploading the image or pointcloud.

Ground truth uploads can be made idempotent by specifying an optional annotation_id for each prediction. This id should be unique within the dataset_item so that (reference_id, annotation_id) is unique within the dataset.

See SegmentationPrediction for specific requirements to upload segmentation predictions.

For ingesting large prediction payloads, see the Guide for Large Ingestions.

Parameters:
  • model (Model) – Nucleus-generated model ID (starts with prj_). This can be retrieved via list_models() or a Nucleus dashboard URL.

  • predictions (List[Union[ BoxPrediction, PolygonPrediction, CuboidPrediction, SegmentationPrediction, CategoryPrediction SceneCategoryPrediction ]]) – List of prediction objects to upload.

  • update (bool) – Whether or not to overwrite metadata or ignore on reference ID collision. Default is False.

  • asynchronous (bool) – Whether or not to process the upload asynchronously (and return an AsyncJob object). Default is False.

  • batch_size (int) – Number of predictions processed in each concurrent batch. Default is 5000. If you get timeouts when uploading geometric predictions, you can try lowering this batch size. This is only relevant for asynchronous=False

  • remote_files_per_upload_request (int) – Number of remote files to upload in each request. Segmentations have either local or remote files, if you are getting timeouts while uploading segmentations with remote urls, you should lower this value from its default of 20. This is only relevant for asynchronous=False.

  • local_files_per_upload_request (int) – Number of local files to upload in each request. Segmentations have either local or remote files, if you are getting timeouts while uploading segmentations with local files, you should lower this value from its default of 10. The maximum is 10. This is only relevant for asynchronous=False

  • trained_slice_id (Optional[str]) – Nucleus-generated slice ID (starts with slc_) which was used to train the model.

Returns:

Payload describing the synchronous upload::
{

“dataset_id”: str, “model_run_id”: str, “predictions_processed”: int, “predictions_ignored”: int,

}

class nucleus.dataset.ObjectQueryType

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

Initialize self. See help(type(self)) for accurate signature.

capitalize()

Return a capitalized version of the string.

More specifically, make the first character have upper case and the rest lower case.

casefold()

Return a version of the string suitable for caseless comparisons.

center()

Return a centered string of length width.

Padding is done using the specified fill character (default is a space).

count()

S.count(sub[, start[, end]]) -> int

Return the number of non-overlapping occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation.

encode()

Encode the string using the codec registered for encoding.

encoding

The encoding in which to encode the string.

errors

The error handling scheme to use for encoding errors. The default is ‘strict’ meaning that encoding errors raise a UnicodeEncodeError. Other possible values are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’ as well as any other name registered with codecs.register_error that can handle UnicodeEncodeErrors.

endswith()

S.endswith(suffix[, start[, end]]) -> bool

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs()

Return a copy where all tab characters are expanded using spaces.

If tabsize is not given, a tab size of 8 characters is assumed.

find()

S.find(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

format()

S.format(*args, **kwargs) -> str

Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{’ and ‘}’).

format_map()

S.format_map(mapping) -> str

Return a formatted version of S, using substitutions from mapping. The substitutions are identified by braces (‘{’ and ‘}’).

index()

S.index(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

isalnum()

Return True if the string is an alpha-numeric string, False otherwise.

A string is alpha-numeric if all characters in the string are alpha-numeric and there is at least one character in the string.

isalpha()

Return True if the string is an alphabetic string, False otherwise.

A string is alphabetic if all characters in the string are alphabetic and there is at least one character in the string.

isascii()

Return True if all characters in the string are ASCII, False otherwise.

ASCII characters have code points in the range U+0000-U+007F. Empty string is ASCII too.

isdecimal()

Return True if the string is a decimal string, False otherwise.

A string is a decimal string if all characters in the string are decimal and there is at least one character in the string.

isdigit()

Return True if the string is a digit string, False otherwise.

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

isidentifier()

Return True if the string is a valid Python identifier, False otherwise.

Call keyword.iskeyword(s) to test whether string s is a reserved identifier, such as “def” or “class”.

islower()

Return True if the string is a lowercase string, False otherwise.

A string is lowercase if all cased characters in the string are lowercase and there is at least one cased character in the string.

isnumeric()

Return True if the string is a numeric string, False otherwise.

A string is numeric if all characters in the string are numeric and there is at least one character in the string.

isprintable()

Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in repr() or if it is empty.

isspace()

Return True if the string is a whitespace string, False otherwise.

A string is whitespace if all characters in the string are whitespace and there is at least one character in the string.

istitle()

Return True if the string is a title-cased string, False otherwise.

In a title-cased string, upper- and title-case characters may only follow uncased characters and lowercase characters only cased ones.

isupper()

Return True if the string is an uppercase string, False otherwise.

A string is uppercase if all cased characters in the string are uppercase and there is at least one cased character in the string.

join()

Concatenate any number of strings.

The string whose method is called is inserted in between each given string. The result is returned as a new string.

Example: ‘.’.join([‘ab’, ‘pq’, ‘rs’]) -> ‘ab.pq.rs’

ljust()

Return a left-justified string of length width.

Padding is done using the specified fill character (default is a space).

lower()

Return a copy of the string converted to lowercase.

lstrip()

Return a copy of the string with leading whitespace removed.

If chars is given and not None, remove characters in chars instead.

name()

The name of the Enum member.

partition()

Partition the string into three parts using the given separator.

This will search for the separator in the string. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing the original string and two empty strings.

removeprefix()

Return a str with the given prefix string removed if present.

If the string starts with the prefix string, return string[len(prefix):]. Otherwise, return a copy of the original string.

removesuffix()

Return a str with the given suffix string removed if present.

If the string ends with the suffix string and that suffix is not empty, return string[:-len(suffix)]. Otherwise, return a copy of the original string.

replace()

Return a copy with all occurrences of substring old replaced by new.

count

Maximum number of occurrences to replace. -1 (the default value) means replace all occurrences.

If the optional argument count is given, only the first count occurrences are replaced.

rfind()

S.rfind(sub[, start[, end]]) -> int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Return -1 on failure.

rindex()

S.rindex(sub[, start[, end]]) -> int

Return the highest index in S where substring sub is found, such that sub is contained within S[start:end]. Optional arguments start and end are interpreted as in slice notation.

Raises ValueError when the substring is not found.

rjust()

Return a right-justified string of length width.

Padding is done using the specified fill character (default is a space).

rpartition()

Partition the string into three parts using the given separator.

This will search for the separator in the string, starting at the end. If the separator is found, returns a 3-tuple containing the part before the separator, the separator itself, and the part after it.

If the separator is not found, returns a 3-tuple containing two empty strings and the original string.

rsplit()

Return a list of the substrings in the string, using sep as the separator string.

sep

The separator used to split the string.

When set to None (the default value), will split on any whitespace character (including \n \r \t \f and spaces) and will discard empty strings from the result.

maxsplit

Maximum number of splits (starting from the left). -1 (the default value) means no limit.

Splitting starts at the end of the string and works to the front.

rstrip()

Return a copy of the string with trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

split()

Return a list of the substrings in the string, using sep as the separator string.

sep

The separator used to split the string.

When set to None (the default value), will split on any whitespace character (including \n \r \t \f and spaces) and will discard empty strings from the result.

maxsplit

Maximum number of splits (starting from the left). -1 (the default value) means no limit.

Note, str.split() is mainly useful for data that has been intentionally delimited. With natural text that includes punctuation, consider using the regular expression module.

splitlines()

Return a list of the lines in the string, breaking at line boundaries.

Line breaks are not included in the resulting list unless keepends is given and true.

startswith()

S.startswith(prefix[, start[, end]]) -> bool

Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

strip()

Return a copy of the string with leading and trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.

swapcase()

Convert uppercase characters to lowercase and lowercase characters to uppercase.

title()

Return a version of the string where each word is titlecased.

More specifically, words start with uppercased characters and all remaining cased characters have lower case.

translate()

Replace each character in the string using the given translation table.

table

Translation table, which must be a mapping of Unicode ordinals to Unicode ordinals, strings, or None.

The table must implement lookup/indexing via __getitem__, for instance a dictionary or list. If this operation raises LookupError, the character is left untouched. Characters mapped to None are deleted.

upper()

Return a copy of the string converted to uppercase.

value()

The value of the Enum member.

zfill()

Pad a numeric string with zeros on the left, to fill a field of the given width.

The string is never truncated.