Deep Learning Practice | Utkarsh Sahu

This experiential course bridges the gap between deep learning theory and industrial application. We begin with a deep dive into the Hugging Face ecosystem, mastering data streaming and custom tokenization pipelines. The curriculum then tackles advanced training paradigms for Large Language Models, including PEFT (Parameter-Efficient Fine-Tuning) via LoRA and QLoRA, and memory-efficient strategies like Gradient Accumulation and Mixed Precision training. The latter half of the course shifts to specialized domains, solving real-world challenges in Speech (Diarization, TTS/STT), and Computer Vision—ranging from mosquito detection using YOLO to medical image super-resolution using SRGANs.

Instructors

Prof. Mitesh M. Khapra, Dept. of CS & Engineering, IIT Madras
Prof. S. Umesh, Dept. of Electrical Engineering, IIT Madras
Dr. Kaushik Mitra, Dept. of Electrical Engineering, IIT Madras

Course Schedule & Topics

The course is structured over 12 weeks, moving from NLP foundations to advanced Speech and Vision applications.

Week	Primary Focus	Key Topics Covered
1	Modern NLP & Hugging Face	Transformers intro, HF Ecosystem (Datasets, Tokenizers), and Dataset Streaming.
2	Tokenization Pipelines	Normalization, Pre-tokenization, and training custom Tokenizer algorithms.
3	Downstream Fine-tuning	Task-specific heads, freezing parameters, and full-parameter fine-tuning.
4	Advanced LLM Training	Continual pre-training, PEFT (LoRA/QLoRA), and Memory-efficient optimization.
5	Speech: Identification	Spoken Language Identification (SLI) techniques and models.
6	Speech: Diarization	Identifying “who spoke when” in multi-speaker conversational data.
7	Speech: Synthesis	Architectures for Speech-to-Text (STT) and Text-to-Speech (TTS) synthesis.
8	Speech: Wake Word	Personalization and detection for “Hey Google” or “Alexa” style triggers.
9	CV: Image Classification	AlexNet, VGG, ResNet, and Vision Transformers (ViT) on imbalanced datasets.
10	CV: Object Detection	YOLO and RCNN applied to specialized tasks like mosquito recognition.
11	CV: Depth Estimation	UNet, UNet++, and Pix2Pix models for low-light environment challenges.
12	CV: Super-resolution	Using SRGANs for medical image enhancement without pre-trained baselines.

Material used

All learning materials, code examples, and case studies were provided through the course portal.
Hands-on comptetions on Kaggle.