Deep Learning Practice
A practitioner's guide to state-of-the-art AI, covering the Hugging Face ecosystem, Large Language Model tuning, Speech Processing, and Computer Vision challenges.
This experiential course bridges the gap between deep learning theory and industrial application. We begin with a deep dive into the Hugging Face ecosystem, mastering data streaming and custom tokenization pipelines. The curriculum then tackles advanced training paradigms for Large Language Models, including PEFT (Parameter-Efficient Fine-Tuning) via LoRA and QLoRA, and memory-efficient strategies like Gradient Accumulation and Mixed Precision training. The latter half of the course shifts to specialized domains, solving real-world challenges in Speech (Diarization, TTS/STT), and Computer Vision—ranging from mosquito detection using YOLO to medical image super-resolution using SRGANs.
Instructors
- Prof. Mitesh M. Khapra, Dept. of CS & Engineering, IIT Madras
- Prof. S. Umesh, Dept. of Electrical Engineering, IIT Madras
- Dr. Kaushik Mitra, Dept. of Electrical Engineering, IIT Madras
Course Schedule & Topics
The course is structured over 12 weeks, moving from NLP foundations to advanced Speech and Vision applications.
| Week | Primary Focus | Key Topics Covered |
|---|---|---|
| 1 | Modern NLP & Hugging Face | Transformers intro, HF Ecosystem (Datasets, Tokenizers), and Dataset Streaming. |
| 2 | Tokenization Pipelines | Normalization, Pre-tokenization, and training custom Tokenizer algorithms. |
| 3 | Downstream Fine-tuning | Task-specific heads, freezing parameters, and full-parameter fine-tuning. |
| 4 | Advanced LLM Training | Continual pre-training, PEFT (LoRA/QLoRA), and Memory-efficient optimization. |
| 5 | Speech: Identification | Spoken Language Identification (SLI) techniques and models. |
| 6 | Speech: Diarization | Identifying “who spoke when” in multi-speaker conversational data. |
| 7 | Speech: Synthesis | Architectures for Speech-to-Text (STT) and Text-to-Speech (TTS) synthesis. |
| 8 | Speech: Wake Word | Personalization and detection for “Hey Google” or “Alexa” style triggers. |
| 9 | CV: Image Classification | AlexNet, VGG, ResNet, and Vision Transformers (ViT) on imbalanced datasets. |
| 10 | CV: Object Detection | YOLO and RCNN applied to specialized tasks like mosquito recognition. |
| 11 | CV: Depth Estimation | UNet, UNet++, and Pix2Pix models for low-light environment challenges. |
| 12 | CV: Super-resolution | Using SRGANs for medical image enhancement without pre-trained baselines. |
Material used
- All learning materials, code examples, and case studies were provided through the course portal.
- Hands-on comptetions on Kaggle.