unstructured data free download

Showing 91 open source projects for "unstructured data"

View related business solutions

Try Google Cloud Risk-Free With $300 in Credit
No hidden charges. No surprise bills. Cancel anytime.

Use your credit across every product. Compute, storage, AI, analytics. When it runs out, 20+ products stay free. You only pay when you choose to.

Start Free
Gemini 3 and 200+ AI Models on One Platform
Access Google's best plus Claude, Llama, and Gemma. Fine-tune and deploy from one console.

Build generative AI apps with Vertex AI. Switch between models without switching platforms.

Start Free
1

Unstructured.IO

Open source libraries and APIs to build custom preprocessing pipelines

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data into structured outputs.

Downloads: 0 This Week

Last Update: 2026-02-24
See Project
2

Instill Core

Instill Core is a full-stack AI infrastructure tool for data

Instill Core is an open-source, full-stack AI infrastructure platform designed to orchestrate data pipelines, machine learning models, and unstructured data processing into a unified, production-ready system. It provides an end-to-end solution that enables developers to build, deploy, and manage AI-powered applications without needing to manually stitch together multiple tools across the data and model lifecycle. The platform focuses heavily on handling unstructured data such as documents, images, audio, and video, transforming them into AI-ready formats through integrated ETL pipelines and processing workflows. ...

Downloads: 3 This Week

Last Update: 6 days ago
See Project
3

Superlinked

Superlinked is a Python framework for AI Engineers

Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.

Downloads: 0 This Week

Last Update: 2025-10-22
See Project
4

LOTUS

AI-Powered Data Processing: Use LOTUS to process all of your datasets

LOTUS is an open-source framework and query engine designed to enable efficient processing of structured and unstructured datasets using large language models. The system provides a declarative programming model that allows developers to express complex AI data operations using high-level commands rather than manually orchestrating model calls. It offers a Python interface with a Pandas-like API, making it familiar for data scientists and engineers already working with data analysis libraries. ...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project
Add Two Lines of Code. Get Full APM.
AppSignal installs in minutes and auto-configures dashboards, alerts, and error tracking.

Works out of the box for Rails, Django, Express, Phoenix, and more. Monitoring exceptions and performance in no time.

Start Free
5

DataProfiler

Extract schema, statistics and entities from datasets

DataProfiler is an AI-powered tool for automatic data analysis and profiling, designed to detect patterns, anomalies, and schema inconsistencies in structured and unstructured datasets. The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame.

Downloads: 0 This Week

Last Update: 2025-07-30
See Project
6

LlamaIndex

Central interface to connect your LLM's with external data

LlamaIndex (GPT Index) is a project that provides a central interface to connect your LLM's with external data. LlamaIndex is a simple, flexible interface between your external data and LLMs. It provides the following tools in an easy-to-use fashion. Provides indices over your unstructured and structured data for use with LLM's. These indices help to abstract away common boilerplate and pain points for in-context learning. Dealing with prompt limitations (e.g. 4096 tokens for Davinci) when the context is too big. ...

Downloads: 9 This Week

Last Update: 2026-03-16
See Project
7

LlamaParse

Parse files for optimal RAG

LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). Load in 160+ data sources and data formats, from unstructured, and semi-structured, to structured data (API's, PDFs, documents, SQL, etc.) Store and index your data for different use cases. Integrate with 40+ vector stores, document stores, graph stores, and SQL db providers.

Downloads: 0 This Week

Last Update: 2026-02-13
See Project
8

DeepAnalyze

Autonomous LLM agent for end-to-end data science workflows

DeepAnalyze is an open source project that introduces an agentic large language model designed to perform autonomous data science tasks from start to finish. It is built to handle the entire data science pipeline, including data preparation, analysis, modeling, visualization, and report generation without requiring continuous human guidance. DeepAnalyze is capable of conducting open-ended data research across multiple data formats such as structured tables, semi-structured files, and unstructured text, enabling flexible and comprehensive analysis workflows. ...

Downloads: 3 This Week

Last Update: 53 minutes ago
See Project
9

LiteParse

A fast, helpful, and open-source document parser

LiteParse is an open-source lightweight parsing library designed to extract structured data from unstructured text using large language models in an efficient and cost-effective manner. It focuses on simplifying the process of turning raw text into structured outputs such as JSON by providing a streamlined interface for prompt-based parsing. The system is designed to minimize overhead, making it suitable for applications where performance and cost are critical considerations. ...

Downloads: 4 This Week

Last Update: 2 days ago
See Project
Our Free Plans just got better! | Auth0
With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.

Try free now
10

OpenViking

Context database designed specifically for AI Agents

OpenViking is an open-source context database engineered for efficient indexing and retrieval of large amounts of unstructured or semi-structured context data used by AI applications. It’s primarily designed to serve as a high-performance, scalable backend for storing app context, embeddings, conversational histories, and other textual artifacts that need rapid lookup and semantic search, which makes it especially useful for systems like chatbots or memory-augmented agents. ...

Downloads: 8 This Week

Last Update: 9 hours ago
See Project
11

MeshLab

The open source mesh processing system

MeshLab is an open-source, portable, and extensible system for the processing and editing of unstructured large 3D triangular meshes. It is aimed to help the processing of the typical not-so-small unstructured models arising in 3D scanning, providing a set of tools for editing, cleaning, healing, inspecting, rendering and converting this kind of meshes. MeshLab is mostly based on the open source C++ mesh processing library VCGlib developed at the Visual Computing Lab of ISTI - CNR. ...

Downloads: 52 This Week

Last Update: 2025-07-22
See Project
12

Milvus

Vector database for scalable similarity search and AI applications

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment. Milvus 2.0 is a cloud-native vector database with storage and computation separated by design. All components in this refactored version of Milvus are stateless to enhance elasticity and flexibility. Average latency measured in milliseconds on trillion vector datasets. ...

Downloads: 0 This Week

Last Update: 2 days ago
See Project
13

Unstract

No-code LLM Platform to launch APIs and ETL Pipelines

Unstract is a powerful open-source, no-code platform built to automate the extraction and structuring of unstructured documents using large language models and flexible workflows, enabling developers and data teams to turn messy files into organized JSON content without complex coding. It integrates a visual Prompt Studio environment where users can iteratively design extraction schemas, compare outputs from different models, and monitor costs and accuracy side by side, making it easier to refine prompts and extraction logic before deploying at scale. ...

Downloads: 2 This Week

Last Update: 17 hours ago
See Project
14

Scanopy

Clean network diagrams, One-time setup, zero upkeep

Scanopy is a powerful multi-modal data capture and analysis toolkit that enables users to collect, process, and visualize structured and unstructured information from a variety of sources in a flexible pipeline. It is built to handle complex scanning tasks — such as OCR, document analysis, audio transcription, network data capture, and image extraction — while providing unified APIs and workflows that make managing heterogeneous data sources seamless. ...

Downloads: 16 This Week

Last Update: 2 days ago
See Project
15

TextFSM

Python module for parsing semi-structured text into python tables

TextFSM is a Python library created by Google that provides a template-based state machine engine for parsing semi-structured text. It is particularly useful for extracting structured data from command-line interface (CLI) outputs, such as those from network devices, routers, and switches. By defining parsing logic through reusable template files, TextFSM transforms unstructured text into structured data like lists or tables without requiring complex regular expression code. Each template defines states, transitions, and regex patterns that determine how to interpret text line by line, enabling precise extraction of key information from varied sources. ...

Downloads: 0 This Week

Last Update: 2025-10-11
See Project
16

RAGFlow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.

Downloads: 8 This Week

Last Update: 2026-02-10
See Project
17

DocETL

A system for agentic LLM-powered data processing and ETL

DocETL is an open-source system designed to build and execute data processing pipelines powered by large language models, particularly for analyzing complex collections of documents and unstructured datasets. The platform allows developers and researchers to construct structured workflows that extract, transform, and organize information from sources such as reports, transcripts, legal documents, and other text-heavy data.

Downloads: 0 This Week

Last Update: 2026-03-05
See Project
18

DataChain

AI-data warehouse to enrich, transform and analyze unstructured data

Datachain enables multimodal API calls and local AI inferences to run in parallel over many samples as chained operations. The resulting datasets can be saved, versioned, and sent directly to PyTorch and TensorFlow for training. Datachain can persist features of Python objects returned by AI models, and enables vectorized analytical operations over them. The typical use cases are data curation, LLM analytics and validation, image segmentation, pose detection, and GenAI alignment. Datachain...

Downloads: 3 This Week

Last Update: 1 day ago
See Project
19

GraphRAG

A modular graph-based Retrieval-Augmented Generation (RAG) system

The GraphRAG project is a data pipeline and transformation suite that is designed to extract meaningful, structured data from unstructured text using the power of LLMs.

Downloads: 0 This Week

Last Update: 1 day ago
See Project
20

Airweave

Airweave lets agents search any app

Airweave is an open-source platform that enables agents to semantically search across various applications, databases, and APIs. By transforming disparate data sources into a unified, searchable knowledge base, Airweave facilitates intelligent information retrieval through REST APIs or the MCP protocol. It's particularly useful for building AI agents that require access to structured and unstructured data across multiple platforms.

Downloads: 7 This Week

Last Update: 6 hours ago
See Project
21

CrateDB

CrateDB is a distributed and scalable SQL database

CrateDB is a distributed SQL database designed for massive machine data and real-time analytics. It combines the scalability and performance of NoSQL with the power and simplicity of SQL, allowing for horizontal scaling, full-text search, and complex queries over large datasets. Built in Java and powered by Elasticsearch and Lucene, CrateDB is optimized for high-velocity data ingestion and dynamic queries.

Downloads: 0 This Week

Last Update: 2026-03-17
See Project
22

Fluentd

Fluentd: Unified Logging Layer (project under CNCF)

Fluentd is a CNCF‑graduated open-source data collector that unifies log data collection and consumption across diverse systems. It supports robust reliability, buffering, extensible plugin architecture, and real-time log routing. Fluentd serves as a unified logging layer for structured/unstructured data processing.

Downloads: 0 This Week

Last Update: 2026-02-13
See Project
23

skycaiji

Open source web scraping system for automated data collection tasks

SkyCaiji is an open source web scraping and data collection system designed to gather information from websites through configurable extraction rules. It focuses on simplifying the process of building crawlers by allowing users to visually define scraping rules rather than writing complex code. It can collect structured or unstructured data from many types of webpages and automate the extraction process for large datasets.

Downloads: 9 This Week

Last Update: 4 days ago
See Project
24

Thulite

Web framework designed for speed, security, and SEO

Thulite is an AI-powered search and recommendation engine that enhances search functionality in applications. It provides intelligent query processing, result ranking, and personalized recommendations.

Downloads: 0 This Week

Last Update: 2026-02-24
See Project
25

Extractous

Fast and efficient unstructured data extraction

Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...

Downloads: 0 This Week

Last Update: 2026-03-06
See Project