SciPy 2024 GenAI Tutorial

Overview

Led the design and delivery of a comprehensive tutorial at SciPy 2024 Conference in Tacoma, WA, teaching scientists how to create AI-powered question-and-answering applications using entirely open-source tools and models. This groundbreaking tutorial represented the eScience Institute's first foray into Generative AI education with a fully open-source stack.

Tutorial Objectives

Enable scientists to:

Build Retrieval-Augmented Generation (RAG) applications for research
Use open-source Large Language Models (LLMs) like OLMo
Create AI copilots tailored to their scientific domains
Deploy and scale AI applications for research teams

Key Components

Module 1: Introduction to LLMs

Introduction to Large Language Models and their applications in scientific research.

Module 2: Vector Databases & Embeddings

Understanding how to represent and search scientific documents using vector embeddings.

Module 3: RAG Application Development

Role: Lead Instructor
Building a complete Retrieval-Augmented Generation application that combines:

Document ingestion and preprocessing
Vector database for semantic search
LLM integration for natural language responses
Web interface for user interaction

Technical Stack

LLM: OLMo - fully open-source language model
Vector Database: Open-source vector storage and retrieval
Python Libraries: LangChain, Hugging Face Transformers
Infrastructure: JupyterHub for participant compute environments
Deployment: Scalable cloud infrastructure for tutorial delivery

Leadership Roles

As tutorial lead, I:

Designed Tutorial Modules: Architected the overall learning experience and content flow
Infrastructure Engineering: Set up computing environment for 100+ participants
Content Development: Created initial tutorial materials and code examples
Instruction: Delivered Module 3 on RAG application development
Technical Support: Ensured smooth execution during the live tutorial

Innovation & Impact

First Open-Source GenAI Tutorial

This tutorial broke new ground by using entirely open-source tools:

No proprietary APIs (OpenAI, Anthropic, etc.)
Fully reproducible stack
No vendor lock-in
Complete transparency in model behavior

Empowering Scientists

Enabled researchers to:

Build domain-specific AI assistants
Process and query large scientific literature corpora
Maintain control over their data and models
Understand AI capabilities and limitations

Infrastructure Challenges

Successfully deployed a complex infrastructure to support:

Real-time LLM inference for multiple users
Vector database operations at scale
Interactive Jupyter environments
Stable performance throughout 3-hour tutorial

Community Reception

The tutorial received excellent feedback from SciPy attendees, with particular appreciation for:

Practical, hands-on approach
Open-source commitment
Scientific use case focus
Comprehensive coverage of RAG pipeline

Future Directions

The tutorial materials serve as a foundation for:

Ongoing GenAI education programs at eScience Institute
Integration into hackweek curricula
Best practices for scientific AI applications
Community standards for responsible AI use in research