Deep Learning Practice

A practitioner's guide to state-of-the-art AI, covering the Hugging Face ecosystem, Large Language Model tuning, Speech Processing, and Computer Vision challenges.

This experiential course bridges the gap between deep learning theory and industrial application. We begin with a deep dive into the Hugging Face ecosystem, mastering data streaming and custom tokenization pipelines. The curriculum then tackles advanced training paradigms for Large Language Models, including PEFT (Parameter-Efficient Fine-Tuning) via LoRA and QLoRA, and memory-efficient strategies like Gradient Accumulation and Mixed Precision training. The latter half of the course shifts to specialized domains, solving real-world challenges in Speech (Diarization, TTS/STT), and Computer Vision—ranging from mosquito detection using YOLO to medical image super-resolution using SRGANs.


Instructors


Course Schedule & Topics

The course is structured over 12 weeks, moving from NLP foundations to advanced Speech and Vision applications.

Week Primary Focus Key Topics Covered
1 Modern NLP & Hugging Face Transformers intro, HF Ecosystem (Datasets, Tokenizers), and Dataset Streaming.
2 Tokenization Pipelines Normalization, Pre-tokenization, and training custom Tokenizer algorithms.
3 Downstream Fine-tuning Task-specific heads, freezing parameters, and full-parameter fine-tuning.
4 Advanced LLM Training Continual pre-training, PEFT (LoRA/QLoRA), and Memory-efficient optimization.
5 Speech: Identification Spoken Language Identification (SLI) techniques and models.
6 Speech: Diarization Identifying “who spoke when” in multi-speaker conversational data.
7 Speech: Synthesis Architectures for Speech-to-Text (STT) and Text-to-Speech (TTS) synthesis.
8 Speech: Wake Word Personalization and detection for “Hey Google” or “Alexa” style triggers.
9 CV: Image Classification AlexNet, VGG, ResNet, and Vision Transformers (ViT) on imbalanced datasets.
10 CV: Object Detection YOLO and RCNN applied to specialized tasks like mosquito recognition.
11 CV: Depth Estimation UNet, UNet++, and Pix2Pix models for low-light environment challenges.
12 CV: Super-resolution Using SRGANs for medical image enhancement without pre-trained baselines.

Material used

  • All learning materials, code examples, and case studies were provided through the course portal.
  • Hands-on comptetions on Kaggle.