Showing 702 open source projects for "dataset"

View related business solutions
  • Catch Bugs Before Your Customers Do Icon
    Catch Bugs Before Your Customers Do

    Real-time error alerts, performance insights, and anomaly detection across your full stack. Free 30-day trial.

    Move from alert to fix before users notice. AppSignal monitors errors, performance bottlenecks, host health, and uptime—all from one dashboard. Instant notifications on deployments, anomaly triggers for memory spikes or error surges, and seamless log management. Works out of the box with Rails, Django, Express, Phoenix, Next.js, and dozens more. Starts at $23/month with no hidden fees.
    Try AppSignal Free
  • Gemini 3 and 200+ AI Models on One Platform Icon
    Gemini 3 and 200+ AI Models on One Platform

    Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

    Build generative AI apps with Vertex AI. Switch between models without switching platforms.
    Start Free
  • 1
    Easy DataSet

    Easy DataSet

    A powerful tool for creating datasets for LLM fine-tuning

    ...The system includes automated question-generation capabilities, hierarchical label trees, and answer generation pipelines that use LLM APIs to produce coherent paired data with customizable templates. Beyond dataset creation, Easy-dataset also provides a built-in evaluation system with model testing and blind-test features, helping teams validate model performance using curated test sets.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    Mathematics Dataset

    Mathematics Dataset

    This dataset code generates mathematical question and answer pairs

    The Mathematics Dataset, developed by Google DeepMind, is a synthetic dataset designed to evaluate and train machine learning models on mathematical reasoning and symbolic manipulation. It generates question-and-answer pairs across a wide range of mathematical topics typically found in school-level curricula, testing a model’s ability to reason about algebra, arithmetic, calculus, probability, and more.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    The Hypersim Dataset

    The Hypersim Dataset

    Photorealistic Synthetic Dataset for Holistic Indoor Scene

    Hypersim is a large-scale, photorealistic synthetic dataset and tooling suite for indoor scene understanding research. It provides richly annotated renderings—RGB, depth, surface normals, instance and semantic segmentations, and material/lighting metadata—produced from high-fidelity virtual environments. The dataset spans diverse furniture layouts, room types, and camera trajectories, enabling robust training for geometry, segmentation, and SLAM-adjacent tasks.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    The Unsplash Dataset

    The Unsplash Dataset

    Unsplash images made available for research and machine learning

    The Unsplash Dataset is made up of over 350,000+ contributing global photographers and data sourced from hundreds of millions of searches across a nearly unlimited number of uses and contexts. Due to the breadth of intent and semantics contained within the Unsplash dataset, it enables new opportunities for research and learning.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 5
    DataSet Serialize for Delphi and Lazarus

    DataSet Serialize for Delphi and Lazarus

    JSON to DataSet and DataSet to JSON converter for Delphi and Lazarus

    DataSet Serialize is a set of features to make working with JSON and DataSet simple. It has features such as exporting or importing records into a DataSet, validate if JSON has all required attributes (previously entered in the DataSet), exporting or importing the structure of DataSet fields in JSON format. In addition to managing nested JSON through master detail or using TDataSetField (you choose the way that suits you best).
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    Passport Index Dataset

    Passport Index Dataset

    Passport Index 2023: visa requirements for 199 countries, in .csv

    There are 6 datasets with identical visa requirements data. Three datasets are matrix and three are long (tidy) formats. Each comes in 3 versions: with country codes as specified in ISO-2 (two-letter codes), ISO-3 (three-letter codes), and full country names from no particular standard. In distance matrices (files with matrix in the filename), the first column represents a passport (=from), each remaining column represents a destination (=to). Files in tidy format (with tidy in filename)...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    Exclusively Dark Image Dataset

    Exclusively Dark Image Dataset

    ExDARK dataset is the largest collection of low-light images

    The Exclusively Dark (ExDARK) dataset is one of the largest curated collections of real-world low-light images designed to support research in computer vision tasks under challenging lighting conditions. It contains 7,363 images captured across ten different low-light scenarios, ranging from extremely dark environments to twilight. Each image is annotated with both image-level labels and object-level bounding boxes for 12 object categories, making it suitable for detection and classification tasks. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 8
    Image Harmonization Dataset iHarmony4

    Image Harmonization Dataset iHarmony4

    The first large-scale public benchmark dataset for image harmonization

    This repository provides the iHarmony4 dataset, which is a large-scale dataset designed for image harmonization tasks. Image harmonization involves adjusting the appearance of a foreground in a composite image so that it is consistent with the background (in color, tone, illumination, etc.). The iHarmony4 dataset comprises four sub-datasets (HCOCO, HAdobe5k, HFlickr, Hday2night), each making composite images by combining a foreground from one image with a background from another, along with associated ground truth harmonized images and foreground masks. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    embedchain

    embedchain

    Framework to easily create LLM powered bots over any dataset

    Embedchain is a framework to easily create LLM-powered bots over any dataset. If you want a javascript version, check out embedchain-js. Embedchain empowers you to create chatbot models similar to ChatGPT, using your own evolving dataset. Start building LLM powered bots under 30 seconds.
    Downloads: 1 This Week
    Last Update:
    See Project
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • 10
    IPFS GeoIP

    IPFS GeoIP

    GeoIP lookup over DAG-CBOR dataset loaded from IPFS

    GeoIP lookup over IPFS. GeoIP lookup over DAG-CBOR dataset loaded from IPFS.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    Datasets

    Datasets

    Hub of ready-to-use datasets for ML models

    Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks. Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. We also feature a deep integration with the Hugging Face Hub, allowing you to easily load and share a dataset with the wider NLP community. ...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 12
    Fluid

    Fluid

    Fluid, elastic data abstraction and acceleration for BigData/AI apps

    Fluid, elastic data abstraction and acceleration for BigData/AI applications in the cloud. Provide DataSet abstraction for underlying heterogeneous data sources with multidimensional management in a cloud environment. Enable dataset warmup and acceleration for data-intensive applications by using a distributed cache in Kubernetes with observability, portability, and scalability. Taking characteristics of application and data into consideration for cloud application/dataset scheduling to improve the performance.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Parquet.jl

    Parquet.jl

    Julia implementation of Parquet columnar file format reader

    A parquet file or dataset can be loaded using the read_parquet function. A parquet dataset is a directory with multiple parquet files, each of which is a partition belonging to the dataset.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Datumaro

    Datumaro

    Dataset Management Framework, a Python library and a CLI tool to build

    Datumaro is a flexible Python-based dataset management framework and command-line tool for building, analyzing, transforming, and converting computer vision datasets in many popular formats. It supports importing and exporting annotations and images across a wide variety of standards like COCO, PASCAL VOC, YOLO, ImageNet, Cityscapes, and many more, enabling easy integration with different training pipelines and tools.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    CO3D (Common Objects in 3D)

    CO3D (Common Objects in 3D)

    Tooling for the Common Objects In 3D dataset

    CO3Dv2 (Common Objects in 3D, version 2) is a large-scale 3D computer vision dataset and toolkit from Facebook Research designed for training and evaluating category-level 3D reconstruction methods using real-world data. It builds upon the original CO3Dv1 dataset, expanding both scale and quality—featuring 2× more sequences and 4× more frames, with improved image fidelity, more accurate segmentation masks, and enhanced annotations for object-centric 3D reconstruction. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    DOLMA

    DOLMA

    Data and tools for generating and inspecting OLMo pre-training data

    DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Open X-Embodiment

    Open X-Embodiment

    Unified open dataset enabling cross-embodiment learning for robotics

    Open X-Embodiment is a large-scale collaborative initiative led by Google DeepMind to unify robotic learning datasets into a consistent and standardized format, simplifying access and usage across the robotics research community. Its primary goal is to make all available open-source robotic data interoperable by representing them using the RLDS (Reinforcement Learning Dataset Structure) episode format. This enables seamless integration for training, evaluation, and model development across diverse robotic tasks and embodiments. The dataset aggregates contributions from multiple open-source robotic projects, all harmonized under a single unified data schema. The repository also provides Colab notebooks for dataset visualization, batching, and model inference, along with pretrained model checkpoints such as RT-1-X, a multitask robotic transformer model trained on this data.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 18
    In-The-Wild Jailbreak Prompts on LLMs

    In-The-Wild Jailbreak Prompts on LLMs

    A dataset consists of 15,140 ChatGPT prompts from Reddit

    ...Researchers analyze these prompts to identify patterns, attack strategies, and techniques commonly used to trick language models into producing restricted or harmful outputs. The dataset includes thousands of prompts collected across multiple platforms and represents one of the largest collections of jailbreak attempts available for research.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    CleanVision

    CleanVision

    Automatically find issues in image datasets

    CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc. This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning. CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset! The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    CellTypist

    CellTypist

    A tool for semi-automatic cell type classification, harmonization

    CellTypist is an automated tool for cell type classification, harmonization, and integration. Classification, transfer cell type labels from the reference to query dataset. Harmonization, match and harmonize cell types defined by independent datasets. integration, integrate cell and cell types with supervision from harmonization. CellTypist recapitulates cell type structure and biology of independent datasets. Regularised linear models with Stochastic Gradient Descent provide a fast and accurate prediction. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    labelme Image Polygonal Annotation

    labelme Image Polygonal Annotation

    Image polygonal annotation with Python

    ...Image flag annotation for classification and cleaning. Video annotation. (video annotation). GUI customization (predefined labels / flags, auto-saving, label validation, etc). Exporting VOC-format dataset for semantic/instance segmentation. (semantic segmentation, instance segmentation). Exporting COCO-format dataset for instance segmentation. (instance segmentation). The first time you run labelme, it will create a config file in ~/.labelmerc. You can edit this file and the changes will be applied the next time that you launch labelme. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 22
    img2dataset

    img2dataset

    Easily turn large sets of image urls to an image dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports saving captions for url+caption datasets. Opt-out directives: Websites can pass the http headers X-Robots-Tag: noai, X-Robots-Tag: noindex , X-Robots-Tag: noimageai and X-Robots-Tag: noimageindex By default img2dataset will ignore images with such headers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    NYC Taxi Data

    NYC Taxi Data

    Import public NYC taxi and for-hire vehicle (Uber, Lyft)

    ...It also contains example analyses—spatial and temporal visualizations like maps, time-series plots, and hotspot detection—highlighting insights such as patterns of demand, peak times, and geospatial distributions. The repository is often used as a benchmark dataset and example for teaching, benchmarking, and demonstration purposes in the data science and urban analytics communities.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    Bespoke Curator

    Bespoke Curator

    Synthetic data curation for post-training and data extraction

    ...It supports workflows where models are used to produce synthetic examples that can later be refined into reliable training datasets for reasoning, question answering, or structured information extraction tasks. Curator includes tools for monitoring data generation processes and managing dataset quality while large batches of examples are being created. The framework also integrates with multiple inference systems and APIs, allowing users to generate data using different model providers or open-source inference engines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    HDF5.jl

    HDF5.jl

    Save and load data in the HDF5 file format from Julia

    HDF5 stands for Hierarchical Data Format v5 and is closely modeled on file systems. In HDF5, a "group" is analogous to a directory, a "dataset" is like a file. HDF5 also uses "attributes" to associate metadata with a particular group or dataset. HDF5 uses ASCII names for these different objects, and objects can be accessed by Unix-like pathnames, e.g., "/sample1/tempsensor/firsttrial" for a top-level group "sample1", a subgroup "tempsensor", and a dataset "firsttrial". ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB