Alternatives to Hume AI

Compare Hume AI alternatives for your business or organization using the curated list below. SourceForge ranks the best alternatives to Hume AI in 2026. Compare features, ratings, user reviews, pricing, and more from Hume AI competitors and alternatives in order to make an informed decision for your business.

  • 1
    Speechmatics

    Speechmatics

    Speechmatics

    Best-in-Market Speech-to-Text & Voice AI for Enterprises. Speechmatics delivers industry-leading Speech-to-Text and Voice AI for enterprises needing unrivaled accuracy, security, and flexibility. Our enterprise-grade APIs provide real-time and batch transcription with exceptional precision—across the widest range of languages, dialects, and accents. Powered by Foundational Speech Technology, Speechmatics supports mission-critical voice applications in media, contact centers, finance, healthcare, and more. With on-prem, cloud, and hybrid deployment, businesses maintain full control over data security while unlocking voice insights. Trusted by global leaders, Speechmatics is the top choice for best-in-class transcription and voice intelligence. 🔹 Unmatched Accuracy – Superior transcription across languages & accents 🔹 Flexible Deployment – Cloud, on-prem, and hybrid 🔹 Enterprise-Grade Security – Full data control 🔹 Real-Time & Batch Processing – Scalable transcription
    Starting Price: $0 per month
  • 2
    CallFinder

    CallFinder

    CallFinder

    CallFinder speech analytics software automates outdated, manual QA processes to save time and provide immediate insights so you can make data-driven decisions. CallFinder automatically transcribes and scores recorded calls, identifying key metrics you can use to improve every aspect of your business. We deliver a highly scalable Software as a Service (SaaS) solution to contact centers and small and medium sized businesses across a wide range of industries. We like to think of ourselves as the speech analytics experts because, well, that’s all we do. We’re all about delivering a truly different software experience. You never get something that’s out-of-the-box. On purpose. Our Managed Client Services support is a differentiator that none of our speech analytics competitors offer. Your CallFinder Analyst becomes an integral part of your QA team, and you will work with your Analyst on a recurring basis to optimize CallFinder to meet your evolving business needs.
  • 3
    Google Cloud Natural Language API
    Get insightful text analysis with machine learning that extracts, analyzes, and stores text. Train high-quality machine learning custom models without a single line of code with AutoML. Apply natural language understanding (NLU) to apps with Natural Language API. Use entity analysis to find and label fields within a document, including emails, chat, and social media, and then sentiment analysis to understand customer opinions to find actionable product and UX insights. Natural Language with speech-to-text API extracts insights from audio. Vision API adds optical character recognition (OCR) for scanned docs. Translation API understands sentiments in multiple languages. Use custom entity extraction to identify domain-specific entities within documents, many of which don’t appear in standard language models, without having to spend time or money on manual analysis. Train your own high-quality machine learning custom models to classify, extract, and detect sentiment.
  • 4
    Play.ht

    Play.ht

    Play.ht

    AI Powered Text to Voice Generation. Play.ht offers uncanny, high-fidelity AI Voices for any project where you need human-sounding voice overs and performances. Hollywood studios, auto manufacturers, and other large enterprises use Play.ht to create realistic and engaging voiceovers quickly, without the hassle of scheduling and hiring voice talent. Our voices sound natural, expressive, and engaging, just like human voice talent. Play.ht offers API access as well as an online rich-text editor that allows you to generate entire performances with multiple speakers, edit their pacing, and generate unique versions of each paragraph - all within seconds. Join other companies looking to scale up and simplify their voice work by scheduling a live demo today.
    Starting Price: $199 per month
  • 5
    Amazon Rekognition
    Amazon Rekognition makes it easy to add image and video analysis to your applications using proven, highly scalable, deep learning technology that requires no machine learning expertise to use. With Amazon Rekognition, you can identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content. Amazon Rekognition also provides highly accurate facial analysis and facial search capabilities that you can use to detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases. With Amazon Rekognition Custom Labels, you can identify the objects and scenes in images that are specific to your business needs. For example, you can build a model to classify specific machine parts on your assembly line or to detect unhealthy plants. Amazon Rekognition Custom Labels takes care of the heavy lifting of model development for you, so no machine learning experience is required.
  • 6
    Amazon Polly
    Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Polly's Text-to-Speech (TTS) service uses advanced deep learning technologies to synthesize natural sounding human speech. With dozens of lifelike voices across a broad set of languages, you can build speech-enabled applications that work in many different countries. In addition to Standard TTS voices, Amazon Polly offers Neural Text-to-Speech (NTTS) voices that deliver advanced improvements in speech quality through a new machine learning approach. Polly’s Neural TTS technology also supports two speaking styles that allow you to better match the delivery style of the speaker to the application: a Newscaster reading style that is tailored to news narration use cases, and a Conversational speaking style that is ideal for two-way communication like telephony applications.
  • 7
    Komprehend

    Komprehend

    Komprehend

    Komprehend AI APIs are the most comprehensive set of document classification and NLP APIs for software developers. Our NLP models are trained on more than a billion documents and provide state-of-the-art accuracy on most common NLP use cases such as sentiment analysis and emotion detection. Try our free demo now and see the effectiveness of our Text Analysis API. Maintains high accuracy in the real world, and brings out useful insights from open-ended textual data. Works on a variety of data, ranging from finance to healthcare. Supports private cloud deployments via Docker containers or on-premise deployment ensuring no data leakage. Protects your data and follows the GDPR compliance guidelines to the last word. Understand the social sentiment of your brand, product, or service while monitoring online conversations. Sentiment analysis is contextual mining of text which identifies and extracts subjective information in the source material.
    Starting Price: $79 per month
  • 8
    PolygrAI

    PolygrAI

    PolygrAI

    PolygrAI is an innovative platform that provides real-time insights into emotional states and potential deception. Performing polygraph examination has never been easier with our desktop application, just click start, choose your video feed source, and see the insights. Our interface allows you to see through words and gain insights into the subconscious. The most important and comprehensive metric, simplified for your convenience. Helping you understand the overall sentiment throughout the entire examination. Categorized with priority, having primary, secondary, and tertiary emotions detected. When choosing a subject person, all others shown in the video feed will be ignored. Our desktop application is packed with other features designed to help you perform better and easier assessments. You can choose the default screen capturing which allows you to use with any other application, or connect a USB camera.
    Starting Price: $28/month
  • 9
    Dandelion API

    Dandelion API

    SpazioDati

    Find mentions of places, people, brands and events in documents and social media. Easily get additional data about the entities. Classify multilingual text into standard, pre-defined taxonomies or build your own custom classification scheme in minutes. Identify whether the expressed opinion in short texts (like product reviews) is positive, negative, or neutral. Automatically identify important, contextually relevant, concepts and key-phrases in articles and social media posts. Compare two texts and compute their syntactic and semantic similarity. Understand when two texts are about the same subject. Extract clean text article from newspapers, blogs and other websites. Remove boilerplate and advertising and get the article full text and images.
    Starting Price: $49 per month
  • 10
    Element Human

    Element Human

    Element Human

    Replace clunky ad testing with real world engagement. Attention and Emotions at the speed and scale of a click. We provide the science, the tools, and the platform to quickly set up, measure and respond to human behaviours at scale, cost-effectively. We believe that the more we understand the subconscious and conscious drivers of human behaviour, the better our predictions, decisions, and interactions will be. We are a group of science, technology and design experts obsessed with enabling everyday devices to observe and measure how people live their lives. Our consent-based platform enables everyday devices to safely capture and respond to the ​emotional, memory and thought​ drivers of ​human​ ​behaviour​ as people interact with ​digital experiences​. Through 7 years and 2.5 billion data points collected across 89 countries and 40 businesses, we developed a proprietary solution that monitors and understands how our digital experiences shape human behaviours.
    Starting Price: $2,014.10 per user
  • 11
    FaceReader
    To gain accurate and reliable data about facial expressions, FaceReader is the most robust automated system that will help you out. Clear insights into the effect of different stimuli on emotions. Very easy-to-use, save valuable time and resources. Easy integration with eye tracking data and physiology data. Many researchers have turned towards using automated facial expression analysis software to better provide an objective assessment of emotions. FaceReader software is fast, flexible, objective, accurate, and easy to use. It immediately analyzes your data (live, video, or still images), saving valuable time. The option to record audio as well as video makes it possible to hear what people have been saying, for example, during human-computer interactions, or while watching stimuli. FaceReader is the most robust automated system for the recognition of a number of specific properties in facial images, including the six basic or universal expressions.
  • 12
    Gemini 2.5 Pro TTS
    Gemini 2.5 Pro TTS is Google’s advanced text-to-speech model in the Gemini 2.5 family, optimized for high-quality, expressive, controllable speech synthesis for structured and professional audio generation tasks. The model delivers natural-sounding voice output with enhanced expressivity, tone control, pacing, and pronunciation fidelity, enabling developers to dictate style, accent, rhythm, and emotional nuance through text-based prompts, making it suitable for applications like podcasts, audiobooks, customer assistance, tutorials, and multimedia narration that require premium audio output. It supports both single-speaker and multi-speaker audio, allowing distinct voices and conversational flows in the same output, and can synthesize speech across multiple languages with consistent style adherence. Compared with lower-latency variants like Flash TTS, the Pro TTS model prioritizes sound quality, depth of expression, and nuanced control.
  • 13
    OpenAI Realtime API
    The OpenAI Realtime API is a newly introduced API, announced in 2024, that allows developers to create applications that facilitate real-time, low-latency interactions, such as speech-to-speech conversations. This API is designed for use cases like customer support agents, AI voice assistants, and language learning apps. Unlike previous implementations that required multiple models for speech recognition and text-to-speech conversion, the Realtime API handles these processes seamlessly in one call, enabling applications to handle voice interactions much faster and with more natural flow.
  • 14
    Octave TTS

    Octave TTS

    Hume AI

    Hume AI has introduced Octave (Omni-capable Text and Voice Engine), a groundbreaking text-to-speech system that leverages large language model technology to understand and interpret the context of words, enabling it to generate speech with appropriate emotions, rhythm, and cadence, unlike traditional TTS models that merely read text, Octave acts akin to a human actor, delivering lines with nuanced expression based on the content. Users can create diverse AI voices by providing descriptive prompts, such as "a sarcastic medieval peasant," allowing for tailored voice generation that aligns with specific character traits or scenarios. Additionally, Octave offers the flexibility to modify the emotional delivery and speaking style through natural language instructions, enabling commands like "sound more enthusiastic" or "whisper fearfully" to fine-tune the output.
    Starting Price: $3 per month
  • 15
    Azure Face API
    Embed facial recognition into your apps for a seamless and highly secured user experience. No machine learning expertise is required. Features include: face detection that perceives faces and attributes in an image; person identification that matches an individual in your private repository of up to 1 million people; perceived emotion recognition that detects a range of facial expressions like happiness, contempt, neutrality, and fear; and recognition and grouping of similar faces in images. Recognize faces according to diverse attributes. Add facial recognition to your apps, all through a single API call. Run Face in the cloud or on the edge in containers. Rely on enterprise-grade security and privacy applied to both your data and any trained models. Detect, identify, and analyze faces in images and videos. Build on top of this technology to support various scenarios. Detect one or more human faces along with attributes.
    Starting Price: $0.01 per month
  • 16
    SoundHound

    SoundHound

    SoundHound AI

    We believe every brand should have a voice and every person should be able to interact naturally with the products around them, by simply talking. At SoundHound Inc., we’re working together with our strategic partners to build a more accessible and connected world. We build custom voice assistants for companies wanting to keep their brand, users, and data. Built on the foundation of proprietary Speech-to-Meaning® and Deep Meaning Understanding® technologies, the Houndify platform provides conversational intelligence unmatched by others in the industry. Houndify everything! Voice-enable the world with conversational intelligence. Create a voice AI platform that exceeds human capabilities and brings value and delight via an ecosystem of billions of products enhanced by innovation and monetization opportunities. Headquartered in the heart of Silicon Valley, we are a global company with 9 offices in key markets and teams in 16 countries.
  • 17
    MorphCast
    MorphCast Emotion AI Interactive Video Platform is the most flexible, easy to use and fast solution to let creatives design highly engaging interactive videos in minutes. In addition to the most updated interaction options, the video content can be triggered by the viewer’s facial expressions while watching it, thanks to our Facial Emotion AI integrated in the platform. MorphCast is a dynamic tool created for professionals. You can download it for free from Microsoft and Mac App Store. You will only pay for the minutes of views of your videos, and the first 2.000 minutes per month are always free. MorphCast also offers you an analytics dashboard to evaluate the performance of your interactive videos. You can measure how your contents perform and adjust your audience experience according to their interaction and emotional reaction.
  • 18
    Qwen3-TTS

    Qwen3-TTS

    Alibaba

    Qwen3-TTS is an open source series of advanced text-to-speech models developed by the Qwen team at Alibaba Cloud under the Apache-2.0 license, offering stable, expressive, and real-time speech generation with features such as voice cloning, voice design, and fine-grained control of prosody and acoustic attributes. The models support 10 major languages, including Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian, and multiple dialectal voice profiles with adaptive control over tone, speaking rate, and emotional expression based on text semantics and instructions. Qwen3-TTS uses efficient tokenization and a dual-track architecture that enables ultra-low-latency streaming synthesis (first audio packet in ~97 ms), making it suitable for interactive and real-time use cases, and includes a range of models with different capabilities (e.g., rapid 3-second voice cloning, custom voice timbres, and instruction-based voice design).
    Starting Price: Free
  • 19
    Gemini 2.5 Flash TTS
    Gemini 2.5 Flash TTS is the latest text-to-speech (TTS) model variant in Google’s Gemini 2.5 lineup, designed for faster, low-latency speech synthesis with expressive, controllable audio output. It offers significant enhancements in tone versatility and expressivity so that developers can generate speech that better matches style prompts, from storytelling narrations to character voices, with more natural emotional range. It features precision pacing, which allows it to adjust speech tempo based on context, delivering faster sections or slowing for emphasis more accurately according to instructions. It also supports multi-speaker dialogues with consistent character voices for scenarios like podcasts, interviews, or conversational agents, and improved multilingual handling so each speaker’s unique tone and style persist across languages. Gemini 2.5 Flash TTS is optimized for lower latency, making it ideal for interactive applications and real-time voice interfaces.
  • 20
    Charactr

    Charactr

    Charactr

    Powered by our state-of-the-art WaveThruVec model, transform the text into expressive AI-generated speech with TTS or convert existing or new voice recordings into an AI-generated voice with Voice to Voice conversion. From from photo-realistic to pixel art - and everything in between, generate incredible animated and talking virtual characters that can easily be integrated into your app, game, website, or media project with our upcoming Visual and Motion API. Our API includes a state-of-the-art selection of male, female, and unique synthetic character voices that can be used to add natural and expressive speech into your app, game, or project.
  • 21
    Affect Lab

    Affect Lab

    Affect Lab

    Tech-driven consumer insights platform for Insights teams. Map insights across media, digital and shopper touchpoints, deliver customer experiences that resonate emotionally, optimize customer journey for increased conversions, gain emotion, attention, engagement and noticeability insights. Usability testing and analytics platform for UX teams. Measure attention, engagement and emotion across user journeys, test prototypes, mockups, websites, apps and chatbots, identify key elements within the UI that customers notice, deliver emotionally optimized UX and drive conversions. Emotion Insights to create the best customer experiences. Facial Coding APIs to measure emotional response at scale, single face emotion recognition, in-the-wild multi face emotion recognition, recorded video emotion analysis. Test stimuli of various modes and channels like videos, print ads, planograms, package designs, websites, apps, chatbots, etc.
  • 22
    IBM Watson Tone Analyzer
    The IBM Watson® Tone Analyzer uses linguistic analysis to detect emotional and language tones in written text. Watson Tone Analyzer can analyze tone at both the document and sentence levels. You can use the service to understand how your written communications are perceived and then to improve the tone of your communications. Businesses can use the service to learn the tone of their customers' communications and to respond to each customer appropriately, or to understand and improve their customer conversations. In this tutorial, you will learn how to use IBM Cloud Functions and cognitive and data services to build a serverless back end for a mobile application. Analyze emotions and tones in what people write online, like tweets or reviews. Predict whether they are happy, sad, confident, and more. Enable your chatbot to detect customer tones so you can build dialog strategies to adjust the conversation accordingly.
  • 23
    D-ID

    D-ID

    D-ID

    D-ID is a cutting-edge technology company specializing in generative AI and synthetic media, best known for its innovative Creative Reality Studio. This platform allows users to transform text, images, and audio into photorealistic videos featuring lifelike digital humans with natural facial expressions, speech, and movements. By combining deep learning, computer vision, and advanced AI models, D-ID empowers businesses, educators, and content creators to produce personalized, interactive video content at scale. The Creative Reality Studio enables users to generate talking avatars from static images, making it a popular tool for e-learning, marketing, entertainment, and customer service. Committed to privacy and ethical AI use, D-ID also incorporates facial anonymization technology, ensuring secure and responsible handling of visual data.
    Starting Price: $5.90 per month
  • 24
    Receptiviti

    Receptiviti

    Receptiviti

    Use language to reveal personality traits and drives. Receptiviti maps personalities to the Big Five personality framework. It includes a total of 35 different measures of personality. Understand how people think and behave in social settings by measuring their authenticity, clout, self-focus, affiliation, and more. Understand what is driving a person's behaviour, whether they are driven by the need for achievement and self actualization, domination, reward, avoidance of risk or by engaging in risk-seeking behaviour. Detect abusive or threatening language that expresses prejudice, violence against a particular group on the basis of race, religion, or sexual orientation and more. Determine the author of your text of interest. This tool is especially useful for literary research, cybersecurity, forensics, and social media analysis.
  • 25
    Imentiv AI

    Imentiv AI

    Imentiv AI

    Are you looking to create truly emotionally engaging content? Look no further than Imentiv AI's advanced Emotion AI tool. Our machine learning models analyze the emotions of actors in your videos, providing deep insights into the emotional impact of your content. By understanding the emotions conveyed by your actors and story, you can anticipate how your audience will perceive your content. With Imentiv AI's video emotion analysis solution, you can create content that truly resonates with your viewers, capturing their hearts and minds. Analyze emotions accurately in the video and understand heuristics and biases in your video with the expertise of our trained psychologists. Enhance audience engagement and maximize ROI by analyzing ads, videos, and content with AI. Save time and effort by using AI for emotional impact analysis instead of running lengthy and expensive audience surveys.
    Starting Price: $19 per month
  • 26
    EmoVu

    EmoVu

    Eyeris

    Using advanced artificial intelligence and machine learning EmoVu understands humans' emotions. EmoVu portal allows accurate measurement of video content's emotional engagement and effectiveness on target audiences. We invite both short and long-form video content owners to distribute ready-to-test creative to thousands of emotive viewers through our easy-to-use platform. Gauge messaging resonance and emotional connection to your creative, either for particular scenes or for the overall video before content debuts. Maximize emotional engagement and save wasted budgets on poor content. Use immediately after distribution to track early signs of engagement, social effect, content virality potential, and individual media outlet performances. Maximize content buzz and allocate smart budgets for campaign retargeting. Emotional campaigns are twice as likely to generate large profit gains than rational ones.
  • 27
    ElevenLabs

    ElevenLabs

    ElevenLabs

    The most realistic and versatile AI speech software, ever. Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling. Generate top-quality spoken audio in any voice and style with the most advanced and multipurpose AI speech tool out there. Our deep learning model renders human intonation and inflections with unprecedented fidelity and adjusts delivery based on context. Our AI model is built to grasp the logic and emotions behind words. And rather than generate sentences one-by-one, it’s always mindful of how each utterance ties to preceding and succeeding text. This zoomed-out perspective allows it to intonate longer fragments convincingly and with purpose. And finally you can do this with any voice you want.
    Starting Price: $1 per month
  • 28
    Allganize

    Allganize

    Allganize

    Allganize's industry-leading AI solutions provide businesses with the best tool to automate customer and employee support. Automate an average of 72% of all monthly support tickets within the first 4 months of implementation. Let our AI automate simple customer requests and free up your agents’ time to handle more complex issues. Employees can ask questions in a conversational way and find answers from multiple document types. Conversational AI chat bot pre-trained for your websites and automates customer service. Intelligent search that extracts accurate answers from any document, instantaneously. Automatically extracts important keywords from any document and categorizes them, providing valuable insights. Understands the context of product reviews using one's natural language to automatically detect positive or negative experiences. Assigns predefined categories from customer support conversions to accurately determine user intent.
    Starting Price: $2 per month
  • 29
    MeaningCloud

    MeaningCloud

    MeaningCloud

    MeaningCloud is the easiest, most powerful, and most affordable way to extract the meaning from unstructured content: documents, articles, social conversations, web content, etc. We provide text analytics products to extract the most accurate insights from any content in many languages. And we do it SaaS and On-prem. We work for different industries (pharma, finance, media, retail, hospitality, telco, etc.) developing personalized and industry-oriented solutions.  Pay only for what you use, without any activation fees, minimum time commitment and with the most generous free plan of the market. If you don't like it, you can stop using it, just like that. Without software to install or infrastructure to deploy. All the reliability and scalability of solutions in the cloud, and the possibility of testing it for free.
    Starting Price: $99 per month
  • 30
    Good Vibrations Company (GVC)

    Good Vibrations Company (GVC)

    Good Vibrations Company

    In many GVC applications the first step of the process is emotion recognition: the user speaks for a few seconds, and the GVC Emotion Recognition algorithm measures hundreds of acoustic properties of the user’s voice and distills from these cues an assessment of the user’s emotional state. We can feed the results from our emotion recognition algorithm into algorithms that choose an appropriate feedback to the user. As GVC we are primarily interested in kinds of feedback that improve the user’s performance and quality of life. Measuring the signals provided by the user’s voice, heart, lungs or other organs. The GVC Concept has been implemented in several demo apps. These employ a suite of proprietary algorithms that analyse many facets of the user’s speech, such as the GVC Emotion Recognition and GVC Voice Disorder Detection algorithms.
  • 31
    Vokaturi

    Vokaturi

    Vokaturi

    The Vokaturi software reflects the state of the art in emotion recognition from the human voice. Its algorithms have been designed, and are continually improved, by Paul Boersma, professor of Phonetic Sciences at the University of Amsterdam, who is the main author of the world’s leading speech analysis software Praat. Vokaturi can measure directly from your voice whether you are happy, sad, afraid, angry, or have a neutral state of mind. Currently the open-source version of the software chooses between these five emotions with high accuracy, even if it hears the speaker for the first time. The "plus" version of the software reaches the performance level of a dedicated human listener. As a developer you can easily include the Vokaturi software as a library in your own applications. You can choose between a free open-source license and a paid license.
  • 32
    MARS6

    MARS6

    CAMB.AI

    CAMB.AI's MARS6 is a groundbreaking text-to-speech (TTS) model that has become the first speech model accessible on Amazon Web Services (AWS) Bedrock platform. This integration allows developers to incorporate advanced TTS capabilities into generative AI applications, facilitating the creation of enhanced voice assistants, engaging audiobooks, interactive media, and various audio-centric experiences. MARS6's advanced algorithms enable natural and expressive speech synthesis, setting a new standard for TTS conversion. Developers can access MARS6 directly through the Amazon Bedrock platform, ensuring seamless integration into applications and enhancing user engagement and accessibility. The inclusion of MARS6 in AWS Bedrock's diverse selection of foundation models underscores CAMB.AI's commitment to advancing machine learning and artificial intelligence, providing developers with vital tools to create rich audio experiences supported by AWS's reliable and scalable infrastructure.
  • 33
    Behavioral Signals

    Behavioral Signals

    Behavioral Signals

    We are at the forefront of human communication in a groundbreaking era. Driven by cutting-edge AI technology, we go beyond words, diving deep into the intricacies of human expression. Understanding emotions, assessing behaviors, and predicting intent, we unlock the essence of every interaction. Our transformative impact spans various industries, from strengthening security and defense operations to redefining contact centers and empowering financial institutions with invaluable insights. With our innovative approach, we reshape the way connections are made and understood, ushering in a new era of communication. Our core technology is provided via our Behavioral Signals API, which is responsible to predict low-level and behavioral voice characteristics from audio signals. Applications: - Customer Service - Security, Intelligence, and Law Enforcement - Cognitive Health & Mental Health - Digital Companions/Chatbots - Healthcare - Entertainment
  • 34
    Chirp 3

    Chirp 3

    Google

    ​Google Cloud's Text-to-Speech API introduces Chirp 3, enabling users to create personalized voice models using their own high-quality audio recordings. This feature facilitates the rapid generation of custom voices, which can be utilized to synthesize audio through the Cloud Text-to-Speech API, supporting both streaming and long-form text. Access to this voice cloning capability is restricted to allow-listed users due to safety considerations; interested parties should contact the sales team to be added to the allowed list. Instant Custom Voice creation and synthesis are supported in various languages, including English (US), Spanish (US), and French (Canada), among others. It is available in multiple Google Cloud regions, and supported output formats include LINEAR16, OGG_OPUS, PCM, ALAW, MULAW, and MP3, depending on the API method used.
  • 35
    Affectiva

    Affectiva

    iMotions

    Affectiva, now part of the Smart Eye group, is a pioneering company in Emotion AI, dedicated to bridging the gap between humans and machines. Founded in 2009 by Dr. Rana el Kaliouby and Dr. Rosalind Picard, the company developed innovative technology to detect human emotions, cognitive states, and interactions. Affectiva’s Emotion AI is widely used in industries such as media analytics and automotive, with applications ranging from understanding consumer engagement to enhancing driver safety. The company’s cutting-edge technology is based on machine learning, computer vision, and real-world data annotation, all developed with a strong focus on ethical AI practices.
  • 36
    Orpheus TTS

    Orpheus TTS

    Canopy Labs

    Canopy Labs has introduced Orpheus, a family of state-of-the-art speech large language models (LLMs) designed for human-level speech generation. These models are built on the Llama-3 architecture and are trained on over 100,000 hours of English speech data, enabling them to produce natural intonation, emotion, and rhythm that surpasses current state-of-the-art closed source models. Orpheus supports zero-shot voice cloning, allowing users to replicate voices without prior fine-tuning, and offers guided emotion and intonation control through simple tags. The models achieve low latency, with approximately 200ms streaming latency for real-time applications, reducible to around 100ms with input streaming. Canopy Labs has released both pre-trained and fine-tuned 3B-parameter models under the permissive Apache 2.0 license, with plans to release smaller models of 1B, 400M, and 150M parameters for use on resource-constrained devices.
  • 37
    Raven-1

    Raven-1

    Tavus

    Raven-1 is a multimodal, real-time perceptual AI model from Tavus designed to bring emotional intelligence to artificial intelligence by interpreting human audio, visual, and temporal signals together instead of reducing communication to text alone. It unifies tone, facial expression, body language, hesitation, and contextual dynamics into a rich, unified representation of user intent and state, enabling conversational AI to understand how people communicate in real time with nuanced natural language descriptions rather than static emotion labels. It was engineered to overcome the limitations of traditional systems that rely on transcripts and limited emotion scoring by capturing subtle cues, such as emphasis, sarcasm, engagement shifts, and evolving emotional arcs, and continuously updating this understanding with low latency so responses align with the true context of the interaction.
    Starting Price: $59 per month
  • 38
    CoolTool

    CoolTool

    CoolTool

    Discover and validate what consumers see, think, and feel beyond their conscious control on desktop and mobile. Online webcam eye tracking for identifying where consumers’ attention goes. Online emotion measurement for capturing consumers’ emotional statements during the interaction with digital products. Online implicit tests for uncovering true attitudes and thoughts hidden in the subconscious. Online solutions for remote usability research. We’ve designed the new product UXReality that completely replaces usability labs with a virtual solution. It allows you to do UX research both on desktop and mobile remotely. You get high-quality session recordings and can literally see through the users’ eyes. The solution includes AI-powered webcam eye tracking, emotion measurement, and surveys.
  • 39
    Tobii Pro Sticky
    Sticky by Tobii Pro is a self-service online platform that combines online survey questions with webcam eye tracking and emotion recognition, making advanced quantitative research simple. With this time and cost-efficient method of integrating eye tracking into your research, you can test large panels of consumers as they view targeted shelves, packs, ads, or webpages from their own computers. For a fraction of the cost of traditional in-person research, the online platform Sticky by Tobii Pro enables large-scale quantitative eye tracking research and emotion recognition as a self-service. Using the participant’s webcam, market researchers gain valuable visual and emotional data on the impact and performance of existing or new designs and placements, for example, in packaging and advertising research. The platform is integrated with online survey engines and panel companies with global reach, enabling a distributed data collection setup and with a quick turnaround time.
  • 40
    EyeRecognize

    EyeRecognize

    EyeRecognize

    Our image and video recognition APIs are proven, highly scalable, and leverage deep learning technology that you can implement within your own applications without prior knowledge of machine learning expertise. EyeRecognize’s suite of image and video recognition API services allow you to identify objects, people, text, scenes, and activities in images and videos, as well as detect any faces and NSFW content. Face Detection and Analysis, detect all face in images and video and get attributes such as face location, gender, age, eyes, and even emotion. Text Detection, extract text from images such as license plates, street signs, advertising, and brand names. Identify NSFW "Not Safe for Work" and other potentially inappropriate content across both image and video. The team behind EyeRecognize has been collectively developing artificial intelligence powered applications for over 40 years and first pioneered the use of machine learning to automate content moderation for social media.
  • 41
    Azure Text to Speech
    Build apps and services that speak naturally. Differentiate your brand with a customized, realistic voice generator, and access voices with different speaking styles and emotional tones to fit your use case—from text readers and talkers to customer support chatbots. Enable fluid, natural-sounding text to speech that matches the intonation and emotion of human voices. Tune voice output for your scenarios by easily adjusting rate, pitch, pronunciation, pauses, and more. Engage global audiences by using 400 neural voices across 140 languages and variants. Bring your scenarios like text readers and voice-enabled assistants to life with highly expressive and human-like voices. Neural Text to Speech supports several speaking styles including newscast, customer service, shouting, whispering, and emotions like cheerful and sad.
  • 42
    Emozo

    Emozo

    Emozo Labs

    Emozo’s DIY SaaS Research & Feedback Collection platform uses behavioral and emotional insights to help you drive the right decisions for all digital content. Emozo’s platform helps you go beyond traditional customer data analytics and delve into customers’ hearts and minds to understand the effectiveness and impact of all digital content. You can use Emozo to test the effectiveness of ads, applications, streaming media content, and the likes, on any channel – web, mobile, social media, TV, etc. Emozo’s novel method of combining unconscious (attention and emotion) and stated (survey) responses helps you understand the effectiveness of all digital content very quickly. Emozo leverages AI to enable qualitative research at scale and with speed on customers' devices. Emozo supports iterative design-development processes and offers fully secure data protection for you and your customers.
    Starting Price: $750 per month
  • 43
    BlueML

    BlueML

    Explorance

    Get an in-depth analysis of your open text comments in seconds with Blue Machine Learning (BlueML) solutions. Now you can see what matters most to your students and employees and instantly get more actionable insights to streamline your decisions. Most comment analysis tools use a generic one-size-fits-all approach usually based on customer experience machine learning models. However, when you look at the employee or student journey, they’re made up of specific components around experience and learning. With BlueML, you can leverage three specialized models that will accurately consume and analyze comments from each area along the student and employee journeys, giving you context-specific categorization. Get an accurate view of the overall sentiments in employee and student comments (very negative, negative, neutral, positive, very positive, ambiguous). Gain insights about what emotions employees and students have expressed in their comments.
  • 44
    Amazon Nova Sonic
    ​Amazon Nova Sonic is a state-of-the-art speech-to-speech model that delivers real-time, human-like voice conversations with industry-leading price performance. It unifies speech understanding and generation into a single model, enabling developers to create natural, expressive conversational AI experiences with low latency. Nova Sonic adapts its responses based on the prosody of input speech, such as pace and timbre, resulting in more natural dialogue. It supports function calling and agentic workflows to interact with external services and APIs, including knowledge grounding with enterprise data using Retrieval-Augmented Generation (RAG). It provides robust speech understanding for American and British English across various speaking styles and acoustic conditions, with additional languages coming soon. Nova Sonic handles user interruptions gracefully without dropping conversational context and is robust to background noise.
  • 45
    GPT-Image-1
    OpenAI's Image Generation API, powered by the gpt-image-1 model, enables developers and businesses to integrate high-quality, professional-grade image generation directly into their tools and platforms. This model offers versatility, allowing it to create images across diverse styles, faithfully follow custom guidelines, leverage world knowledge, and accurately render text, unlocking countless practical applications across multiple domains. Leading enterprises and startups across industries, including creative tools, ecommerce, education, enterprise software, and gaming, are already using image generation in their products and experiences. It gives creators the choice and flexibility to experiment with different aesthetic styles. Users can generate and edit images from simple prompts, adjusting styles, adding or removing objects, expanding backgrounds, and more.
    Starting Price: $0.19 per image
  • 46
    Amazon Nova 2 Sonic
    Nova 2 Sonic is Amazon’s real-time speech-to-speech model designed to deliver natural, flowing voice interactions without relying on separate systems for text and audio. It combines speech recognition, speech generation, and text processing in a single model, enabling smooth, human-like conversations that can shift effortlessly between voice and text. With expanded multilingual support and expressive voice options, it produces responses that sound more lifelike and contextually aware. Its one-million-token context window allows for long, continuous interactions without losing track of prior details. It supports asynchronous task handling, meaning users can continue speaking, change topics, or ask follow-up questions while background tasks, such as searching for information or completing a request, continue uninterrupted. This makes voice experiences feel more fluid and less bound by traditional turn-based dialog constraints.
  • 47
    alwaysAI

    alwaysAI

    alwaysAI

    alwaysAI provides developers with a simple and flexible way to build, train, and deploy computer vision applications to a wide variety of IoT devices. Select from a catalog of deep learning models or upload your own. Use our flexible and customizable APIs to quickly enable core computer vision services. Quickly prototype, test and iterate with a variety of camera-enabled ARM-32, ARM-64 and x86 devices. Identify objects in an image by name or classification. Identify and count objects appearing in a real-time video feed. Follow the same object across a series of frames. Find faces or full bodies in a scene to count or track. Locate and define borders around separate objects. Separate key objects in an image from background visuals. Determine human body poses, fall detection, emotions. Use our model training toolkit to train an object detection model to identify virtually any object. Create a model tailored to your specific use-case.
  • 48
    Gemini 2.5 Flash Native Audio
    Google has released updated Gemini audio models that significantly expand the platform’s capabilities for natural, expressive voice interactions and real-time conversational AI with the introduction of Gemini 2.5 Flash Native Audio and improved text-to-speech technology. The updated native audio model powers live voice agents that can handle complex workflows, follow detailed user instructions more reliably, and maintain smoother multi-turn conversations by better recalling context from previous turns. It is now available across Google AI Studio, Vertex AI, Gemini Live, and Search Live, enabling developers and products to build interactive voice experiences such as intelligent assistants and enterprise voice agents. In addition to the real-time voice improvements, Google enhanced the underlying Text-to-Speech (TTS) models in the Gemini 2.5 family to offer greater expressivity, tone control, pacing adjustments, and multilingual support, so synthesized speech feels more natural.
  • 49
    FineVoice

    FineVoice

    FineVoice

    FineVoice is an AI-powered voice generation platform designed to create realistic, expressive, human-like speech in seconds. It offers access to over 1,500 AI voices across 154 languages and accents for global content creation. FineVoice supports text-to-speech, voice cloning, voice changing, sound effects, and background music generation in one platform. Users can precisely control emotion, tone, speed, and style to produce natural and engaging audio. The platform is built for creators, educators, and businesses needing professional-quality voiceovers. FineVoice enables fast production for videos, podcasts, e-learning, and advertising. Its intuitive interface makes advanced AI voice technology accessible without technical expertise.
    Starting Price: $5.99 per month
  • 50
    Novita AI

    Novita AI

    novita.ai

    Explore the full spectrum of AI APIs tailored for image, video, audio, and LLM applications. Novita AI is designed to elevate your AI-driven business at the pace of technology, offering model hosting and training solutions. Access 100+ APIs, including AI image generation & editing with 10,000+ models, and training APIs for custom models. Enjoy the cheapest pay-as-you-go pricing, freeing you from GPU maintenance hassles while building your own products. generate images in 2s from 10000+ models with a single click. Updated models with civitai and hugging face. Provide a wide variety of products based on Novita API. You can empower your own products with a quick Novita API integration.
    Starting Price: $0.0015 per image