nucleus.metrics.categorization_metrics

CategorizationF1

Evaluation method that matches categories and returns a CategorizationF1Result that aggregates to the F1 score

CategorizationMetric

Abstract class for metrics related to Categorization

CategorizationResult

Base MetricResult class

class nucleus.metrics.categorization_metrics.CategorizationF1(confidence_threshold=0.0, f1_method='macro', annotation_filters=None, prediction_filters=None)

Evaluation method that matches categories and returns a CategorizationF1Result that aggregates to the F1 score

Parameters:
  • confidence_threshold (float) – minimum confidence threshold for predictions to be taken into account for evaluation. Must be in [0, 1]. Default 0.0

  • f1_method (str) – {‘micro’, ‘macro’, ‘samples’,’weighted’, ‘binary’}, default=’macro’

  • targets. (This parameter is required for multiclass/multilabel)

  • None (If)

  • Otherwise (the scores for each class are returned.)

  • this

  • data (determines the type of averaging performed on the)

  • 'binary' – Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

  • 'micro' – Calculate metrics globally by counting the total true positives, false negatives and false positives.

  • 'macro' – Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

  • 'weighted' – Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

  • 'samples' – Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score()).

  • annotation_filters (Optional[Union[nucleus.metrics.filtering.ListOfOrAndFilters, nucleus.metrics.filtering.ListOfAndFilters]]) –

    Filter predicates. Allowed formats are: ListOfAndFilters where each Filter forms a chain of AND predicates.

    or

    ListOfOrAndFilters where Filters are expressed in disjunctive normal form (DNF), like [[MetadataFilter(“short_haired”, “==”, True), FieldFilter(“label”, “in”, [“cat”, “dog”]), …]. DNF allows arbitrary boolean logical combinations of single field predicates. The innermost structures each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple field predicate. Finally, the most outer list combines these filters as a disjunction (OR).

  • prediction_filters (Optional[Union[nucleus.metrics.filtering.ListOfOrAndFilters, nucleus.metrics.filtering.ListOfAndFilters]]) –

    Filter predicates. Allowed formats are: ListOfAndFilters where each Filter forms a chain of AND predicates.

    or

    ListOfOrAndFilters where Filters are expressed in disjunctive normal form (DNF), like [[MetadataFilter(“short_haired”, “==”, True), FieldFilter(“label”, “in”, [“cat”, “dog”]), …]. DNF allows arbitrary boolean logical combinations of single field predicates. The innermost structures each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple field predicate. Finally, the most outer list combines these filters as a disjunction (OR).

aggregate_score(results)

A metric must define how to aggregate results from single items to a single ScalarResult.

E.g. to calculate a R2 score with sklearn you could define a custom metric class

class R2Result(MetricResult):
    y_true: float
    y_pred: float

And then define an aggregate_score

def aggregate_score(self, results: List[MetricResult]) -> ScalarResult:
    y_trues = []
    y_preds = []
    for result in results:
        y_true.append(result.y_true)
        y_preds.append(result.y_pred)
    r2_score = sklearn.metrics.r2_score(y_trues, y_preds)
    return ScalarResult(r2_score)
Parameters:

results (List[CategorizationResult])

Return type:

nucleus.metrics.base.ScalarResult

call_metric(annotations, predictions)

A metric must override this method and return a metric result, given annotations and predictions.

Parameters:
Return type:

CategorizationResult

eval(annotations, predictions)

Notes: This is a little weird eval function. It essentially only does matching of annotation to label and the actual metric computation happens in the aggregate step since F1 score only makes sense on a collection.

Parameters:
Return type:

CategorizationResult

class nucleus.metrics.categorization_metrics.CategorizationMetric(confidence_threshold=0.0, annotation_filters=None, prediction_filters=None)

Abstract class for metrics related to Categorization

The Categorization class automatically filters incoming annotations and predictions for only categorization annotations. It also filters predictions whose confidence is less than the provided confidence_threshold.

Initializes CategorizationMetric abstract object.

Parameters:
  • confidence_threshold (float) – minimum confidence threshold for predictions to be taken into account for evaluation. Must be in [0, 1]. Default 0.0

  • annotation_filters (Optional[Union[nucleus.metrics.filtering.ListOfOrAndFilters, nucleus.metrics.filtering.ListOfAndFilters]]) –

    Filter predicates. Allowed formats are: ListOfAndFilters where each Filter forms a chain of AND predicates.

    or

    ListOfOrAndFilters where Filters are expressed in disjunctive normal form (DNF), like [[MetadataFilter(“short_haired”, “==”, True), FieldFilter(“label”, “in”, [“cat”, “dog”]), …]. DNF allows arbitrary boolean logical combinations of single field predicates. The innermost structures each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple field predicate. Finally, the most outer list combines these filters as a disjunction (OR).

  • prediction_filters (Optional[Union[nucleus.metrics.filtering.ListOfOrAndFilters, nucleus.metrics.filtering.ListOfAndFilters]]) –

    Filter predicates. Allowed formats are: ListOfAndFilters where each Filter forms a chain of AND predicates.

    or

    ListOfOrAndFilters where Filters are expressed in disjunctive normal form (DNF), like [[MetadataFilter(“short_haired”, “==”, True), FieldFilter(“label”, “in”, [“cat”, “dog”]), …]. DNF allows arbitrary boolean logical combinations of single field predicates. The innermost structures each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple field predicate. Finally, the most outer list combines these filters as a disjunction (OR).

abstract aggregate_score(results)

A metric must define how to aggregate results from single items to a single ScalarResult.

E.g. to calculate a R2 score with sklearn you could define a custom metric class

class R2Result(MetricResult):
    y_true: float
    y_pred: float

And then define an aggregate_score

def aggregate_score(self, results: List[MetricResult]) -> ScalarResult:
    y_trues = []
    y_preds = []
    for result in results:
        y_true.append(result.y_true)
        y_preds.append(result.y_pred)
    r2_score = sklearn.metrics.r2_score(y_trues, y_preds)
    return ScalarResult(r2_score)
Parameters:

results (List[CategorizationResult])

Return type:

nucleus.metrics.base.ScalarResult

call_metric(annotations, predictions)

A metric must override this method and return a metric result, given annotations and predictions.

Parameters:
Return type:

CategorizationResult

class nucleus.metrics.categorization_metrics.CategorizationResult

Base MetricResult class