Join me as I share my journey from utilizing Convolutional Neural Networks (CNNs) and multi-modal systems before the era of Large Language Models (LLMs), to my recent experiences at Hugging Face. With fresh eyes, I've explored and benefited from remarkable tools that weren't available just a few years ago, and I'm excited to share these insights with you. We'll delve into multi-modal video LLMs—examining the state of the art, understanding their limitations, and exploring the tools and datasets that are shaping the future of the field.
Join this session to learn the latest approaches for preparing data and training video large language models (LLMs), focusing on best practices for data processing, model architecture, and training techniques. This knowledge will be reinforced through an in-depth examination of models and tools that have successfully implemented these methods.