Overview
Led the design and delivery of a comprehensive tutorial at SciPy 2024 Conference in Tacoma, WA, teaching scientists how to create AI-powered question-and-answering applications using entirely open-source tools and models. This groundbreaking tutorial represented the eScience Institute's first foray into Generative AI education with a fully open-source stack.
Tutorial Objectives
Enable scientists to:
- Build Retrieval-Augmented Generation (RAG) applications for research
- Use open-source Large Language Models (LLMs) like OLMo
- Create AI copilots tailored to their scientific domains
- Deploy and scale AI applications for research teams
Key Components
Module 1: Introduction to LLMs
Introduction to Large Language Models and their applications in scientific research.
Module 2: Vector Databases & Embeddings
Understanding how to represent and search scientific documents using vector embeddings.
Module 3: RAG Application Development
Role: Lead Instructor
Building a complete Retrieval-Augmented Generation application that combines:
- Document ingestion and preprocessing
- Vector database for semantic search
- LLM integration for natural language responses
- Web interface for user interaction
Technical Stack
- LLM: OLMo - fully open-source language model
- Vector Database: Open-source vector storage and retrieval
- Python Libraries: LangChain, Hugging Face Transformers
- Infrastructure: JupyterHub for participant compute environments
- Deployment: Scalable cloud infrastructure for tutorial delivery
Leadership Roles
As tutorial lead, I:
- Designed Tutorial Modules: Architected the overall learning experience and content flow
- Infrastructure Engineering: Set up computing environment for 100+ participants
- Content Development: Created initial tutorial materials and code examples
- Instruction: Delivered Module 3 on RAG application development
- Technical Support: Ensured smooth execution during the live tutorial
Innovation & Impact
First Open-Source GenAI Tutorial
This tutorial broke new ground by using entirely open-source tools:
- No proprietary APIs (OpenAI, Anthropic, etc.)
- Fully reproducible stack
- No vendor lock-in
- Complete transparency in model behavior
Empowering Scientists
Enabled researchers to:
- Build domain-specific AI assistants
- Process and query large scientific literature corpora
- Maintain control over their data and models
- Understand AI capabilities and limitations
Infrastructure Challenges
Successfully deployed a complex infrastructure to support:
- Real-time LLM inference for multiple users
- Vector database operations at scale
- Interactive Jupyter environments
- Stable performance throughout 3-hour tutorial
Community Reception
The tutorial received excellent feedback from SciPy attendees, with particular appreciation for:
- Practical, hands-on approach
- Open-source commitment
- Scientific use case focus
- Comprehensive coverage of RAG pipeline
Future Directions
The tutorial materials serve as a foundation for:
- Ongoing GenAI education programs at eScience Institute
- Integration into hackweek curricula
- Best practices for scientific AI applications
- Community standards for responsible AI use in research