Alternatives to Apache Mahout

Compare Apache Mahout alternatives for your business or organization using the curated list below. SourceForge ranks the best alternatives to Apache Mahout in 2026. Compare features, ratings, user reviews, pricing, and more from Apache Mahout competitors and alternatives in order to make an informed decision for your business.

  • 1
    MLlib

    MLlib

    Apache Software Foundation

    ​Apache Spark's MLlib is a scalable machine learning library that integrates seamlessly with Spark's APIs, supporting Java, Scala, Python, and R. It offers a comprehensive suite of algorithms and utilities, including classification, regression, clustering, collaborative filtering, and tools for constructing machine learning pipelines. MLlib's high-quality algorithms leverage Spark's iterative computation capabilities, delivering performance up to 100 times faster than traditional MapReduce implementations. It is designed to operate across diverse environments, running on Hadoop, Apache Mesos, Kubernetes, standalone clusters, or in the cloud, and accessing various data sources such as HDFS, HBase, and local files. This flexibility makes MLlib a robust solution for scalable and efficient machine learning tasks within the Apache Spark ecosystem. ​
  • 2
    Apache Spark

    Apache Spark

    Apache Software Foundation

    Apache Spark™ is a unified analytics engine for large-scale data processing. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.
  • 3
    Deeplearning4j

    Deeplearning4j

    Deeplearning4j

    DL4J takes advantage of the latest distributed computing frameworks including Apache Spark and Hadoop to accelerate training. On multi-GPUs, it is equal to Caffe in performance. The libraries are completely open-source, Apache 2.0, and maintained by the developer community and Konduit team. Deeplearning4j is written in Java and is compatible with any JVM language, such as Scala, Clojure, or Kotlin. The underlying computations are written in C, C++, and Cuda. Keras will serve as the Python API. Eclipse Deeplearning4j is the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Apache Spark, DL4J brings AI to business environments for use on distributed GPUs and CPUs. There are a lot of parameters to adjust when you're training a deep-learning network. We've done our best to explain them, so that Deeplearning4j can serve as a DIY tool for Java, Scala, Clojure, and Kotlin programmers.
  • 4
    E-MapReduce
    EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm. Alibaba Cloud Elastic MapReduce (EMR) is a big data processing solution that runs on the Alibaba Cloud platform. EMR is built on Alibaba Cloud ECS instances and is based on open-source Apache Hadoop and Apache Spark. EMR allows you to use the Hadoop and Spark ecosystem components, such as Apache Hive, Apache Kafka, Flink, Druid, and TensorFlow, to analyze and process data. You can use EMR to process data stored on different Alibaba Cloud data storage service, such as Object Storage Service (OSS), Log Service (SLS), and Relational Database Service (RDS). You can quickly create clusters without the need to configure hardware and software. All maintenance operations are completed on its Web interface.
  • 5
    Horovod

    Horovod

    Horovod

    Horovod was originally developed by Uber to make distributed deep learning fast and easy to use, bringing model training time down from days and weeks to hours and minutes. With Horovod, an existing training script can be scaled up to run on hundreds of GPUs in just a few lines of Python code. Horovod can be installed on-premise or run out-of-the-box in cloud platforms, including AWS, Azure, and Databricks. Horovod can additionally run on top of Apache Spark, making it possible to unify data processing and model training into a single pipeline. Once Horovod has been configured, the same infrastructure can be used to train models with any framework, making it easy to switch between TensorFlow, PyTorch, MXNet, and future frameworks as machine learning tech stacks continue to evolve.
  • 6
    Amazon EMR
    Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open-source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used. For long-running workloads, you can create highly available clusters that automatically scale to meet demand. If you have existing on-premises deployments of open-source tools such as Apache Spark and Apache Hive, you can also run EMR clusters on AWS Outposts. Analyze data using open-source ML frameworks such as Apache Spark MLlib, TensorFlow, and Apache MXNet. Connect to Amazon SageMaker Studio for large-scale model training, analysis, and reporting.
  • 7
    Hadoop

    Hadoop

    Apache Software Foundation

    The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. A wide variety of companies and organizations use Hadoop for both research and production. Users are encouraged to add themselves to the Hadoop PoweredBy wiki page. Apache Hadoop 3.3.4 incorporates a number of significant enhancements over the previous major release line (hadoop-3.2).
  • 8
    Deequ

    Deequ

    Deequ

    Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. We are happy to receive feedback and contributions. Deequ depends on Java 8. Deequ version 2.x only runs with Spark 3.1, and vice versa. If you rely on a previous Spark version, please use a Deequ 1.x version (legacy version is maintained in legacy-spark-3.0 branch). We provide legacy releases compatible with Apache Spark versions 2.2.x to 3.0.x. The Spark 2.2.x and 2.3.x releases depend on Scala 2.11 and the Spark 2.4.x, 3.0.x, and 3.1.x releases depend on Scala 2.12. Deequ's purpose is to "unit-test" data to find errors early, before the data gets fed to consuming systems or machine learning algorithms. In the following, we will walk you through a toy example to showcase the most basic usage of our library.
  • 9
    Apache Hive

    Apache Hive

    Apache Software Foundation

    The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive. Apache Hive is an open source project run by volunteers at the Apache Software Foundation. Previously it was a subproject of Apache® Hadoop®, but has now graduated to become a top-level project of its own. We encourage you to learn about the project and contribute your expertise. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API.
  • 10
    PySpark

    PySpark

    PySpark

    PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrame and can also act as distributed SQL query engine. Running on top of Spark, the streaming feature in Apache Spark enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics.
  • 11
    Apache Storm

    Apache Storm

    Apache Software Foundation

    Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language, and is a lot of fun to use! Apache Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Apache Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate. Apache Storm integrates with the queueing and database technologies you already use. An Apache Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed. Read more in the tutorial.
  • 12
    Apache Phoenix

    Apache Phoenix

    Apache Software Foundation

    Apache Phoenix enables OLTP and operational analytics in Hadoop for low-latency applications by combining the best of both worlds. The power of standard SQL and JDBC APIs with full ACID transaction capabilities and the flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase as its backing store. Apache Phoenix is fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce. Become the trusted data platform for OLTP and operational analytics for Hadoop through well-defined, industry-standard APIs. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.
  • 13
    MXNet

    MXNet

    The Apache Software Foundation

    A hybrid front-end seamlessly transitions between Gluon eager imperative mode and symbolic mode to provide both flexibility and speed. Scalable distributed training and performance optimization in research and production is enabled by the dual parameter server and Horovod support. Deep integration into Python and support for Scala, Julia, Clojure, Java, C++, R and Perl. A thriving ecosystem of tools and libraries extends MXNet and enables use-cases in computer vision, NLP, time series and more. Apache MXNet is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision-making process have stabilized in a manner consistent with other successful ASF projects. Join the MXNet scientific community to contribute, learn, and get answers to your questions.
  • 14
    IBM Analytics Engine
    IBM Analytics Engine provides an architecture for Hadoop clusters that decouples the compute and storage tiers. Instead of a permanent cluster formed of dual-purpose nodes, the Analytics Engine allows users to store data in an object storage layer such as IBM Cloud Object Storage and spins up clusters of computing notes when needed. Separating compute from storage helps to transform the flexibility, scalability and maintainability of big data analytics platforms. Build on an ODPi compliant stack with pioneering data science tools with the broader Apache Hadoop and Apache Spark ecosystem. Define clusters based on your application's requirements. Choose the appropriate software pack, version, and size of the cluster. Use as long as required and delete as soon as an application finishes jobs. Configure clusters with third-party analytics libraries and packages. Deploy workloads from IBM Cloud services like machine learning.
    Starting Price: $0.014 per hour
  • 15
    Apache Eagle

    Apache Eagle

    Apache Software Foundation

    Apache Eagle (called Eagle in the following) is an open source analytics solution for identifying security and performance issues instantly on big data platforms, e.g. Apache Hadoop, Apache Spark etc. It analyzes data activities, yarn applications, jmx metrics, and daemon logs etc., provides state-of-the-art alert engine to identify security breach, performance issues and shows insights. Big data platform normally generates huge amount of operational logs and metrics in realtime. Eagle is founded to solve hard problems in securing and tuning performance for big data platforms by ensuring metrics, logs always available and alerting immediately even under huge traffic. Streaming operational logs and data activities into Eagle platform, including but not limited to audit logs, map/reduce jobs, yarn resource usage, jmx metrics and various daemon logs etc. Generate alerts, show historical trend, and correlate alert with raw data.
  • 16
    Apache Kylin

    Apache Kylin

    Apache Software Foundation

    Apache Kylin™ is an open source, distributed Analytical Data Warehouse for Big Data; it was designed to provide OLAP (Online Analytical Processing) capability in the big data era. By renovating the multi-dimensional cube and precalculation technology on Hadoop and Spark, Kylin is able to achieve near constant query speed regardless of the ever-growing data volume. Reducing query latency from minutes to sub-second, Kylin brings online analytics back to big data. Kylin can analyze 10+ billions of rows in less than a second. No more waiting on reports for critical decisions. Kylin connects data on Hadoop to BI tools like Tableau, PowerBI/Excel, MSTR, QlikSense, Hue and SuperSet, making the BI on Hadoop faster than ever. As an Analytical Data Warehouse, Kylin offers ANSI SQL on Hadoop/Spark and supports most ANSI SQL query functions. Kylin can support thousands of interactive queries at the same time, thanks to the low resource consumption of each query.
  • 17
    Apache Trafodion

    Apache Trafodion

    Apache Software Foundation

    Apache Trafodion is a webscale SQL-on-Hadoop solution enabling transactional or operational workloads on Apache Hadoop. Trafodion builds on the scalability, elasticity, and flexibility of Hadoop. Trafodion extends Hadoop to provide guaranteed transactional integrity, enabling new kinds of big data applications to run on Hadoop. Full-functioned ANSI SQL language support. JDBC/ODBC connectivity for Linux/Windows clients. Distributed ACID transaction protection across multiple statements, tables, and rows. Performance improvements for OLTP workloads with compile-time and run-time optimizations. Support for large data sets using a parallel-aware query optimizer. Reuse existing SQL skills and improve developer productivity. Distributed ACID transactions guarantee data consistency across multiple rows and tables. Interoperability with existing tools and applications. Hadoop and Linux distribution neutral. Easy to add to your existing Hadoop infrastructure.
  • 18
    Apache Sentry

    Apache Sentry

    Apache Software Foundation

    Apache Sentry™ is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. Apache Sentry has successfully graduated from the Incubator in March of 2016 and is now a Top-Level Apache project. Apache Sentry is a granular, role-based authorization module for Hadoop. Sentry provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications on a Hadoop cluster. Sentry currently works out of the box with Apache Hive, Hive Metastore/HCatalog, Apache Solr, Impala and HDFS (limited to Hive table data). Sentry is designed to be a pluggable authorization engine for Hadoop components. It allows you to define authorization rules to validate a user or application’s access requests for Hadoop resources. Sentry is highly modular and can support authorization for a wide variety of data models in Hadoop.
  • 19
    Azure Databricks
    Unlock insights from all your data and build artificial intelligence (AI) solutions with Azure Databricks, set up your Apache Spark™ environment in minutes, autoscale, and collaborate on shared projects in an interactive workspace. Azure Databricks supports Python, Scala, R, Java, and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch, and scikit-learn. Azure Databricks provides the latest versions of Apache Spark and allows you to seamlessly integrate with open source libraries. Spin up clusters and build quickly in a fully managed Apache Spark environment with the global scale and availability of Azure. Clusters are set up, configured, and fine-tuned to ensure reliability and performance without the need for monitoring. Take advantage of autoscaling and auto-termination to improve total cost of ownership (TCO).
  • 20
    Spark Streaming

    Spark Streaming

    Apache Software Foundation

    Spark Streaming brings Apache Spark's language-integrated API to stream processing, letting you write streaming jobs the same way you write batch jobs. It supports Java, Scala and Python. Spark Streaming recovers both lost work and operator state (e.g. sliding windows) out of the box, without any extra code on your part. By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. Build powerful interactive applications, not just analytics. Spark Streaming is developed as part of Apache Spark. It thus gets tested and updated with each Spark release. You can run Spark Streaming on Spark's standalone cluster mode or other supported cluster resource managers. It also includes a local run mode for development. In production, Spark Streaming uses ZooKeeper and HDFS for high availability.
  • 21
    Azure HDInsight
    Run popular open-source frameworks—including Apache Hadoop, Spark, Hive, Kafka, and more—using Azure HDInsight, a customizable, enterprise-grade service for open-source analytics. Effortlessly process massive amounts of data and get all the benefits of the broad open-source project ecosystem with the global scale of Azure. Easily migrate your big data workloads and processing to the cloud. Open-source projects and clusters are easy to spin up quickly without the need to install hardware or manage infrastructure. Big data clusters reduce costs through autoscaling and pricing tiers that allow you to pay for only what you use. Enterprise-grade security and industry-leading compliance with more than 30 certifications helps protect your data. Optimized components for open-source technologies such as Hadoop and Spark keep you up to date.
  • 22
    Apache Giraph

    Apache Giraph

    Apache Software Foundation

    Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google and described in a 2010 paper. Both systems are inspired by the Bulk Synchronous Parallel model of distributed computation introduced by Leslie Valiant. Giraph adds several features beyond the basic Pregel model, including master computation, sharded aggregators, edge-oriented input, out-of-core computation, and more. With a steady development cycle and a growing community of users worldwide, Giraph is a natural choice for unleashing the potential of structured datasets at a massive scale. Apache Giraph is an iterative graph processing framework, built on top of Apache Hadoop.
  • 23
    HugeGraph

    HugeGraph

    HugeGraph

    HugeGraph is a fast-speed and highly-scalable graph database. Billions of vertices and edges can be easily stored into and queried from HugeGraph due to its excellent OLTP ability. As compliance to Apache TinkerPop 3 framework, various complicated graph queries can be accomplished through Gremlin (a powerful graph traversal language). Among its features, it provides compliance to Apache TinkerPop 3, supporting Gremlin. Schema Metadata Management, including VertexLabel, EdgeLabel, PropertyKey and IndexLabel. Multi-type Indexes, supporting exact query, range query and complex conditions combination query. Plug-in Backend Store Driver Framework, supporting RocksDB, Cassandra, ScyllaDB, HBase and MySQL now and easy to add other backend store driver if needed. Integration with Hadoop/Spark. HugeGraph relies on the TinkerPop framework, we refer to the storage structure of Titan and the schema definition of DataStax.
  • 24
    IBM Analytics for Apache Spark
    IBM Analytics for Apache Spark is a flexible and integrated Spark service that empowers data science professionals to ask bigger, tougher questions, and deliver business value faster. It’s an easy-to-use, always-on managed service with no long-term commitment or risk, so you can begin exploring right away. Access the power of Apache Spark with no lock-in, backed by IBM’s open-source commitment and decades of enterprise experience. A managed Spark service with Notebooks as a connector means coding and analytics are easier and faster, so you can spend more of your time on delivery and innovation. A managed Apache Spark services gives you easy access to the power of built-in machine learning libraries without the headaches, time and risk associated with managing a Sparkcluster independently.
  • 25
    BigBI

    BigBI

    BigBI

    BigBI enables data specialists to build their own powerful big data pipelines interactively & efficiently, without any coding! BigBI unleashes the power of Apache Spark enabling: Scalable processing of real Big Data (up to 100X faster) Integration of traditional data (SQL, batch files) with modern data sources including semi-structured (JSON, NoSQL DBs, Elastic, Hadoop), and unstructured (Text, Audio, video), Integration of streaming data, cloud data, AI/ML & graphs
  • 26
    Spark NLP

    Spark NLP

    John Snow Labs

    Experience the power of large language models like never before, unleashing the full potential of Natural Language Processing (NLP) with Spark NLP, the open source library that delivers scalable LLMs. The full code base is open under the Apache 2.0 license, including pre-trained models and pipelines. The only NLP library built natively on Apache Spark. The most widely used NLP library in the enterprise. Spark ML provides a set of machine learning applications that can be built using two main components, estimators and transformers. The estimators have a method that secures and trains a piece of data to such an application. The transformer is generally the result of a fitting process and applies changes to the target dataset. These components have been embedded to be applicable to Spark NLP. Pipelines are a mechanism for combining multiple estimators and transformers in a single workflow. They allow multiple chained transformations along a machine-learning task.
  • 27
    IBM Data Refinery
    Available in IBM Watson® Studio and Watson™ Knowledge Catalog, the data refinery tool saves data preparation time by quickly transforming large amounts of raw data into consumable, quality information that’s ready for analytics. Interactively discover, cleanse, and transform your data with over 100 built-in operations. No coding skills are required. Understand the quality and distribution of your data using dozens of built-in charts, graphs, and statistics. Automatically detect data types and business classifications. Access and explore data residing in a wide spectrum of data sources within your organization or the cloud. Automatically enforce policies set by data governance professionals. Schedule data flow executions for repeatable outcomes. Monitor results and receive notifications. Easily scale out via Apache Spark to apply transformation recipes on full data sets. No management of Apache Spark clusters needed.
  • 28
    Apache PredictionIO
    Apache PredictionIO® is an open-source machine learning server built on top of a state-of-the-art open-source stack for developers and data scientists to create predictive engines for any machine learning task. It lets you quickly build and deploy an engine as a web service on production with customizable templates. Respond to dynamic queries in real-time once deployed as a web service, evaluate and tune multiple engine variants systematically, and unify data from multiple platforms in batch or in real-time for comprehensive predictive analytics. Speed up machine learning modeling with systematic processes and pre-built evaluation measures, support machine learning and data processing libraries such as Spark MLLib and OpenNLP. Implement your own machine learning models and seamlessly incorporate them into your engine. Simplify data infrastructure management. Apache PredictionIO® can be installed as a full machine learning stack, bundled with Apache Spark, MLlib, HBase, Akka HTTP, etc.
  • 29
    IOMETE

    IOMETE

    IOMETE

    IOMETE is a self-hosted data lakehouse platform built on Apache Iceberg, Apache Spark, and Kubernetes. Run it on-premises or in your private cloud — your infrastructure, your data, your control. Built for enterprises in regulated industries, IOMETE eliminates third-party ICT risk at the data layer by architecture — not by contract. No SaaS dependencies. No data leaving your perimeter. Compliance with GDPR, DORA, and NIS2 is structural, not contractual. Included in one platform: - Data Lakehouse(s) - Data Catalog - SQL Editor - Apache Spark Jobs - ML Notebooks - Orchestration Engine - Spark Connect Key capabilities: Apache Iceberg-native storage, Kubernetes-native deployment (K8s + OpenShift), row/column/tag-based access control, Data Mesh support, air-gapped and zero-trust compatible. Transparent pricing — CPU-based, no per-query fees, no billing surprises.
  • 30
    Apache Ranger

    Apache Ranger

    The Apache Software Foundation

    Apache Ranger™ is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. The vision with Ranger is to provide comprehensive security across the Apache Hadoop ecosystem. With the advent of Apache YARN, the Hadoop platform can now support a true data lake architecture. Enterprises can potentially run multiple workloads, in a multi tenant environment. Data security within Hadoop needs to evolve to support multiple use cases for data access, while also providing a framework for central administration of security policies and monitoring of user access. Centralized security administration to manage all security related tasks in a central UI or using REST APIs. Fine grained authorization to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool. Standardize authorization method across all Hadoop components. Enhanced support for different authorization methods - Role based access control etc.
  • 31
    Apache Bigtop

    Apache Bigtop

    Apache Software Foundation

    Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components. Bigtop supports a wide range of components/projects, including, but not limited to, Hadoop, HBase and Spark. Bigtop packages Hadoop RPMs and DEBs, so that you can manage and maintain your Hadoop cluster. Bigtop provides an integrated smoke testing framework, alongside a suite of over 50 test files. Bigtop provides vagrant recipes, raw images, and (work-in-progress) docker recipes for deploying Hadoop from zero. Bigtop support many Operating Systems, including Debian, Ubuntu, CentOS, Fedora, openSUSE and many others. Bigtop includes tools and a framework for testing at various levels (packaging, platform, runtime, etc.) for both initial deployments as well as upgrade scenarios for the entire data platform, not just the individual components.
  • 32
    BigLake

    BigLake

    Google

    BigLake is a storage engine that unifies data warehouses and lakes by enabling BigQuery and open-source frameworks like Spark to access data with fine-grained access control. BigLake provides accelerated query performance across multi-cloud storage and open formats such as Apache Iceberg. Store a single copy of data with uniform features across data warehouses & lakes. Fine-grained access control and multi-cloud governance over distributed data. Seamless integration with open-source analytics tools and open data formats. Unlock analytics on distributed data regardless of where and how it’s stored, while choosing the best analytics tools, open source or cloud-native over a single copy of data. Fine-grained access control across open source engines like Apache Spark, Presto, and Trino, and open formats such as Parquet. Performant queries over data lakes powered by BigQuery. Integrates with Dataplex to provide management at scale, including logical data organization.
    Starting Price: $5 per TB
  • 33
    JanusGraph

    JanusGraph

    JanusGraph

    JanusGraph is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. JanusGraph is a project under The Linux Foundation, and includes participants from Expero, Google, GRAKN.AI, Hortonworks, IBM and Amazon. Elastic and linear scalability for a growing data and user base. Data distribution and replication for performance and fault tolerance. Multi-datacenter high availability and hot backups. All functionality is totally free. No need to buy commercial licenses. JanusGraph is fully open source under the Apache 2 license. JanusGraph is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time. Support for ACID and eventual consistency. In addition to online transactional processing (OLTP), JanusGraph supports global graph analytics (OLAP) with its Apache Spark integration.
  • 34
    Apache Knox

    Apache Knox

    Apache Software Foundation

    The Knox API Gateway is designed as a reverse proxy with consideration for pluggability in the areas of policy enforcement, through providers and the backend services for which it proxies requests. Policy enforcement ranges from authentication/federation, authorization, audit, dispatch, hostmapping and content rewrite rules. Policy is enforced through a chain of providers that are defined within the topology deployment descriptor for each Apache Hadoop cluster gated by Knox. The cluster definition is also defined within the topology deployment descriptor and provides the Knox Gateway with the layout of the cluster for purposes of routing and translation between user facing URLs and cluster internals. Each Apache Hadoop cluster that is protected by Knox has its set of REST APIs represented by a single cluster specific application context path. This allows the Knox Gateway to both protect multiple clusters and present the REST API consumer with a single endpoint.
  • 35
    Apache Flink

    Apache Flink

    Apache Software Foundation

    Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. Any kind of data is produced as a stream of events. Credit card transactions, sensor measurements, machine logs, or user interactions on a website or mobile application, all of these data are generated as a stream. Apache Flink excels at processing unbounded and bounded data sets. Precise control of time and state enable Flink’s runtime to run any kind of application on unbounded streams. Bounded streams are internally processed by algorithms and data structures that are specifically designed for fixed sized data sets, yielding excellent performance. Flink is designed to work well each of the previously listed resource managers.
  • 36
    ZetaAnalytics

    ZetaAnalytics

    Halliburton

    The ZetaAnalytics product requires a compatible database appliance for its Data Warehouse. Landmark has qualified the ZetaAnalytics software using Teradata, EMC Greenplum, and IBM Netezza. Please see the ZetaAnalytics Release Notes for the most up to date qualified versions. Before installing and configuring ZetaAnalytics software, ensure that the Data Warehouse you use for drilling data is created and running. Scripts to create the various Zeta-specific database components within the Data Warehouse will need to be run as part of the installation process. These require database administrator (DBA) rights. The ZetaAnalytics product requires Apache Hadoop for model scoring and real-time streaming. If you do not already have an Apache Hadoop cluster installed in your environment, please install it before running the ZetaAnalytics installer, which will prompt you for the name and port number of your Hadoop Name Server and Map Reducer.
  • 37
    Apache Tomcat
    The Apache Tomcat® software is an open source implementation of the Jakarta Servlet, Jakarta Server Pages, Jakarta Expression Language, Jakarta WebSocket, Jakarta Annotations and Jakarta Authentication specifications. These specifications are part of the Jakarta EE platform. Apache Tomcat software powers numerous large-scale, mission-critical web applications across a diverse range of industries and organizations. Some of these users and their stories are listed on the PoweredBy wiki page. The Apache Tomcat Project is proud to announce the release of version 10.0.10 of Apache Tomcat. This release implements specifications that are part of the Jakarta EE 9 platform.
  • 38
    Yandex Data Proc
    You select the size of the cluster, node capacity, and a set of services, and Yandex Data Proc automatically creates and configures Spark and Hadoop clusters and other components. Collaborate by using Zeppelin notebooks and other web apps via a UI proxy. You get full control of your cluster with root permissions for each VM. Install your own applications and libraries on running clusters without having to restart them. Yandex Data Proc uses instance groups to automatically increase or decrease computing resources of compute subclusters based on CPU usage indicators. Data Proc allows you to create managed Hive clusters, which can reduce the probability of failures and losses caused by metadata unavailability. Save time on building ETL pipelines and pipelines for training and developing models, as well as describing other iterative tasks. The Data Proc operator is already built into Apache Airflow.
    Starting Price: $0.19 per hour
  • 39
    Apache Accumulo

    Apache Accumulo

    Apache Corporation

    With Apache Accumulo, users can store and manage large data sets across a cluster. Accumulo uses Apache Hadoop's HDFS to store its data and Apache ZooKeeper for consensus. While many users interact directly with Accumulo, several open source projects use Accumulo as their underlying store. To learn more about Accumulo, take the Accumulo tour, read the user manual and run the Accumulo example code. Feel free to contact us if you have any questions. Accumulo has a programming mechanism (called Iterators) that can modify key/value pairs at various points in the data management process. Every Accumulo key/value pair has its own security label which limits query results based off user authorizations. Accumulo runs on a cluster using one or more HDFS instances. Nodes can be added or removed as the amount of data stored in Accumulo changes.
  • 40
    Delta Lake

    Delta Lake

    Delta Lake

    Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level. Learn more at Diving into Delta Lake: Unpacking the Transaction Log. In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
  • 41
    Apache Lucene

    Apache Lucene

    Apache Software Foundation

    The Apache Lucene™ project develops open-source search software. The project releases a core search library, named Lucene™ core, as well as PyLucene, a python binding for Lucene. Lucene Core is a Java library providing powerful indexing and search features, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities. The PyLucene sub project provides Python bindings for Lucene Core. The Apache Software Foundation provides support for the Apache community of open-source software projects. Apache Lucene is distributed under a commercially friendly Apache Software license. Apache Lucene set the standard for search and indexing performance. Lucene is the search core of both Apache Solr™ and Elasticsearch™. Our core algorithms along with the Solr search server power applications the world over, ranging from mobile devices to sites like Twitter, Apple and Wikipedia. The goal of Apache Lucene is to provide world class search capabilities.
  • 42
    Stackable

    Stackable

    Stackable

    The Stackable data platform was designed with openness and flexibility in mind. It provides you with a curated selection of the best open source data apps like Apache Kafka, Apache Druid, Trino, and Apache Spark. While other current offerings either push their proprietary solutions or deepen vendor lock-in, Stackable takes a different approach. All data apps work together seamlessly and can be added or removed in no time. Based on Kubernetes, it runs everywhere, on-prem or in the cloud. stackablectl and a Kubernetes cluster are all you need to run your first stackable data platform. Within minutes, you will be ready to start working with your data. Configure your one-line startup command right here. Similar to kubectl, stackablectl is designed to easily interface with the Stackable Data Platform. Use the command line utility to deploy and manage stackable data apps on Kubernetes. With stackablectl, you can create, delete, and update components.
  • 43
    Oracle Cloud Infrastructure Data Flow
    Oracle Cloud Infrastructure (OCI) Data Flow is a fully managed Apache Spark service to perform processing tasks on extremely large data sets without infrastructure to deploy or manage. This enables rapid application delivery because developers can focus on app development, not infrastructure management. OCI Data Flow handles infrastructure provisioning, network setup, and teardown when Spark jobs are complete. Storage and security are also managed, which means less work is required for creating and managing Spark applications for big data analysis. With OCI Data Flow, there are no clusters to install, patch, or upgrade, which saves time and operational costs for projects. OCI Data Flow runs each Spark job in private dedicated resources, eliminating the need for upfront capacity planning. With OCI Data Flow, IT only needs to pay for the infrastructure resources that Spark jobs use while they are running.
    Starting Price: $0.0085 per GB per hour
  • 44
    Flyte

    Flyte

    Union.ai

    The workflow automation platform for complex, mission-critical data and ML processes at scale. Flyte makes it easy to create concurrent, scalable, and maintainable workflows for machine learning and data processing. Flyte is used in production at Lyft, Spotify, Freenome, and others. At Lyft, Flyte has been serving production model training and data processing for over four years, becoming the de-facto platform for teams like pricing, locations, ETA, mapping, autonomous, and more. In fact, Flyte manages over 10,000 unique workflows at Lyft, totaling over 1,000,000 executions every month, 20 million tasks, and 40 million containers. Flyte has been battle-tested at Lyft, Spotify, Freenome, and others. It is entirely open-source with an Apache 2.0 license under the Linux Foundation with a cross-industry overseeing committee. Configuring machine learning and data workflows can get complex and error-prone with YAML.
  • 45
    Amazon MSK
    Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications. Apache Kafka clusters are challenging to setup, scale, and manage in production. When you run Apache Kafka on your own, you need to provision servers, configure Apache Kafka manually, replace servers when they fail, orchestrate server patches and upgrades, architect the cluster for high availability, ensure data is durably stored and secured, setup monitoring and alarms, and carefully plan scaling events to support load changes.
    Starting Price: $0.0543 per hour
  • 46
    Apache Pulsar

    Apache Pulsar

    Apache Software Foundation

    Apache Pulsar is a cloud-native, distributed messaging and streaming platform originally created at Yahoo! and now a top-level Apache Software Foundation project. Easy to deploy, lightweight compute process, developer-friendly APIs, no need to run your own stream processing engine. Run in production at Yahoo! scale for over 5 years, with millions of messages per second across millions of topics. Built from the ground up as a multi-tenant system. Supports isolation, authentication, authorization and quotas. Configurable replication between data centers across multiple geographic regions. Persistent message storage based on Apache BookKeeper. IO-level isolation between write and read operations. Rest admin API for provisioning, administration, tools and monitoring.
  • 47
    JAX

    JAX

    JAX

    ​JAX is a Python library designed for high-performance numerical computing and machine learning research. It offers a NumPy-like API, facilitating seamless adoption for those familiar with NumPy. Key features of JAX include automatic differentiation, just-in-time compilation, vectorization, and parallelization, all optimized for execution on CPUs, GPUs, and TPUs. These capabilities enable efficient computation for complex mathematical functions and large-scale machine-learning models. JAX also integrates with various libraries within its ecosystem, such as Flax for neural networks and Optax for optimization tasks. Comprehensive documentation, including tutorials and user guides, is available to assist users in leveraging JAX's full potential. ​
  • 48
    SelectDB

    SelectDB

    SelectDB

    SelectDB is a modern data warehouse based on Apache Doris, which supports rapid query analysis on large-scale real-time data. From Clickhouse to Apache Doris, to achieve the separation of the lake warehouse and upgrade to the lake warehouse. The fast-hand OLAP system carries nearly 1 billion query requests every day to provide data services for multiple scenes. Due to the problems of storage redundancy, resource seizure, complicated governance, and difficulty in querying and adjustment, the original lake warehouse separation architecture was decided to introduce Apache Doris lake warehouse, combined with Doris's materialized view rewriting ability and automated services, to achieve high-performance data query and flexible data governance. Write real-time data in seconds, and synchronize flow data from databases and data streams. Data storage engine for real-time update, real-time addition, and real-time pre-polymerization.
    Starting Price: $0.22 per hour
  • 49
    Huawei Cloud ModelArts
    ​ModelArts is a comprehensive AI development platform provided by Huawei Cloud, designed to streamline the entire AI workflow for developers and data scientists. It offers a full-lifecycle toolchain that includes data preprocessing, semi-automated data labeling, distributed training, automated model building, and flexible deployment options across cloud, edge, and on-premises environments. It supports popular open source AI frameworks such as TensorFlow, PyTorch, and MindSpore, and allows for the integration of custom algorithms tailored to specific needs. ModelArts features an end-to-end development pipeline that enhances collaboration across DataOps, MLOps, and DevOps, boosting development efficiency by up to 50%. It provides cost-effective AI computing resources with diverse specifications, enabling large-scale distributed training and inference acceleration.
  • 50
    AWS Deep Learning AMIs
    AWS Deep Learning AMIs (DLAMI) provides ML practitioners and researchers with a curated and secure set of frameworks, dependencies, and tools to accelerate deep learning in the cloud. Built for Amazon Linux and Ubuntu, Amazon Machine Images (AMIs) come preconfigured with TensorFlow, PyTorch, Apache MXNet, Chainer, Microsoft Cognitive Toolkit (CNTK), Gluon, Horovod, and Keras, allowing you to quickly deploy and run these frameworks and tools at scale. Develop advanced ML models at scale to develop autonomous vehicle (AV) technology safely by validating models with millions of supported virtual tests. Accelerate the installation and configuration of AWS instances, and speed up experimentation and evaluation with up-to-date frameworks and libraries, including Hugging Face Transformers. Use advanced analytics, ML, and deep learning capabilities to identify trends and make predictions from raw, disparate health data.