The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing. Together with better performance come larger model sizes. This imposes challenges to the memory wall of the current accelerator hardware such as GPU. It is never ideal to train large models such as Vision Transformer, BERT, and GPT on a single GPU or a single machine. There is an urgent demand to train models in a distributed environment. However, distributed training, especially model parallelism, often requires domain expertise in computer systems and architecture. It remains a challenge for AI researchers to implement complex distributed training solutions for their models. Colossal-AI provides a collection of parallel components for you. We aim to support you to write your distributed deep learning models just like how you write your model on your laptop.

Features

  • Heterogeneous Memory Management
  • 24x larger model size on the same hardware
  • Pull from DockerHub
  • Build On Your Own
  • Parallelism strategies
  • Parallelism based on configuration file

Project Samples

Project Activity

See All Activity >

License

Apache License V2.0

Follow Colossal-AI

Colossal-AI Web Site

Other Useful Business Software
99.99% Uptime for MySQL and PostgreSQL on Google Cloud Icon
99.99% Uptime for MySQL and PostgreSQL on Google Cloud

Enterprise Plus edition delivers sub-second maintenance downtime and 2x read/write performance. Built for critical apps.

Cloud SQL Enterprise Plus gives you a 99.99% availability SLA with near-zero downtime maintenance—typically under 10 seconds. Get 2x better read/write performance, intelligent data caching, and 35 days of point-in-time recovery. Supports MySQL, PostgreSQL, and SQL Server with built-in vector search for gen AI apps. New customers get $300 in free credit.
Try Cloud SQL Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of Colossal-AI!