Skip to main content

Dataset Analysis

The Dataset Analysis page enables users to explore a dataset through statistical insights and visual analysis tools. It helps uncover class imbalances, duplicates, noisy labels, and distribution anomalies — all crucial for dataset validation and preparation.


The page includes a top menu with:

  • A dataset selector dropdown to switch between datasets.
  • A shortcut button to view dataset details.
  • Pagination and filtering controls in each tab for efficient navigation.

Tabs Overview

1. Summary Stats

  • Displays total counts for:
    • Images
    • Annotations
    • Metadata records
    • Linked validation models
  • Shows insights like:
    • Number of outliers, noisy labels, and unique images
  • Class Mappings: Pie or donut chart for label distribution (e.g., 'Alive' vs 'Dead')
  • Metadata Distribution: Sunburst chart illustrating fields like age or COVID-19 status


2. Image Similarity

  • Clusters visually similar images
  • Grid view of groups with similarity scores
  • Allows users to:
    • Identify potential duplicates
    • Explore similarity-based edge cases


3. Noisy Labels

  • Highlights images with inconsistent or suspicious labels
  • Shows count of noisy images
  • If none detected, the UI displays an informative placeholder


4. Outlier Images

  • Detects images statistically distant from dataset norms
  • Each image includes:
    • Internal ID
    • Distance metric value
  • Total outlier count is displayed

📸


5. Unique Images

  • Categorizes standout images based on brightness, size, and pixel distribution:
    • Brightest / Darkest
    • Largest / Smallest
    • Largest/Smallest Mean
    • Largest/Smallest Standard Deviation (STDV)

📸


6. Image Statistics

  • Graphical overview of pixel-level statistics:
    • Blur
    • Brightness
    • Contrast
    • Others (selectable from dropdown)
  • Line chart shows distribution across the dataset

📸


Why Use This?

  • Ensure data quality before model training
  • Spot anomalies and cleaning opportunities
  • Visualize label balance and image metadata spread
  • Evaluate dataset uniqueness to avoid overfitting

This toolset is critical for validating medical datasets and improving AI readiness.