Apache Druid
Apache Druid is an open source distributed data store. Druid’s core design combines ideas from data warehouses, timeseries databases, and search systems to create a high performance real-time analytics database for a broad range of use cases. Druid merges key characteristics of each of the 3 systems into its ingestion layer, storage format, querying layer, and core architecture. Druid stores and compresses each column individually, and only needs to read the ones needed for a particular query, which supports fast scans, rankings, and groupBys. Druid creates inverted indexes for string values for fast search and filter. Out-of-the-box connectors for Apache Kafka, HDFS, AWS S3, stream processors, and more. Druid intelligently partitions data based on time and time-based queries are significantly faster than traditional databases. Scale up or down by just adding or removing servers, and Druid automatically rebalances. Fault-tolerant architecture routes around server failures.
Learn more
Apache DataFusion
Apache DataFusion is an extensible, high-performance query engine written in Rust that utilizes Apache Arrow as its in-memory format. Designed for developers building data-centric systems such as databases, data frames, machine learning, and streaming applications, DataFusion offers SQL and DataFrame APIs, a vectorized, multi-threaded, streaming execution engine, and support for partitioned data sources. It natively supports formats like CSV, Parquet, JSON, and Avro, and allows for seamless integration with object stores including AWS S3, Azure Blob Storage, and Google Cloud Storage. The engine features a comprehensive query planner, a state-of-the-art optimizer with capabilities like expression coercion and simplification, projection and filter pushdown, sort and distribution-aware optimizations, and automatic join reordering. DataFusion is highly customizable, enabling the addition of user-defined scalar, aggregate, and window functions, custom data sources, query languages, etc.
Learn more
BigLake
BigLake is a storage engine that unifies data warehouses and lakes by enabling BigQuery and open-source frameworks like Spark to access data with fine-grained access control. BigLake provides accelerated query performance across multi-cloud storage and open formats such as Apache Iceberg. Store a single copy of data with uniform features across data warehouses & lakes. Fine-grained access control and multi-cloud governance over distributed data. Seamless integration with open-source analytics tools and open data formats. Unlock analytics on distributed data regardless of where and how it’s stored, while choosing the best analytics tools, open source or cloud-native over a single copy of data. Fine-grained access control across open source engines like Apache Spark, Presto, and Trino, and open formats such as Parquet. Performant queries over data lakes powered by BigQuery. Integrates with Dataplex to provide management at scale, including logical data organization.
Learn more
TimescaleDB
TimescaleDB is the leading time-series database built on PostgreSQL, designed to handle massive volumes of real-time data efficiently. It enables organizations to store, analyze, and query time-series data — such as IoT sensor data, financial transactions, or event logs — using standard SQL. With hypertables, TimescaleDB automatically partitions data by time and ID for fast ingestion and predictable query performance. Its compression engine reduces storage costs by up to 95%, while continuous aggregates make real-time dashboards instantly responsive. Fully compatible with PostgreSQL, it integrates seamlessly with existing tools and applications. TimescaleDB combines the simplicity of Postgres with the scalability and speed of a specialized analytical system.
Learn more