SciPy 2024 GenAI Tutorial

Generative AI Copilot for Scientific Software tutorial - teaching scientists to build AI-powered Q&A applications with open-source tools.

Technologies

  • Python
  • OLMo
  • RAG
  • Vector Databases
  • JupyterHub
  • Open-source LLMs

Overview

Led the design and delivery of a comprehensive tutorial at SciPy 2024 Conference in Tacoma, WA, teaching scientists how to create AI-powered question-and-answering applications using entirely open-source tools and models. This groundbreaking tutorial represented the eScience Institute's first foray into Generative AI education with a fully open-source stack.

Tutorial Objectives

Enable scientists to:

  • Build Retrieval-Augmented Generation (RAG) applications for research
  • Use open-source Large Language Models (LLMs) like OLMo
  • Create AI copilots tailored to their scientific domains
  • Deploy and scale AI applications for research teams

Key Components

Module 1: Introduction to LLMs

Introduction to Large Language Models and their applications in scientific research.

Module 2: Vector Databases & Embeddings

Understanding how to represent and search scientific documents using vector embeddings.

Module 3: RAG Application Development

Role: Lead Instructor
Building a complete Retrieval-Augmented Generation application that combines:

  • Document ingestion and preprocessing
  • Vector database for semantic search
  • LLM integration for natural language responses
  • Web interface for user interaction

Technical Stack

  • LLM: OLMo - fully open-source language model
  • Vector Database: Open-source vector storage and retrieval
  • Python Libraries: LangChain, Hugging Face Transformers
  • Infrastructure: JupyterHub for participant compute environments
  • Deployment: Scalable cloud infrastructure for tutorial delivery

Leadership Roles

As tutorial lead, I:

  • Designed Tutorial Modules: Architected the overall learning experience and content flow
  • Infrastructure Engineering: Set up computing environment for 100+ participants
  • Content Development: Created initial tutorial materials and code examples
  • Instruction: Delivered Module 3 on RAG application development
  • Technical Support: Ensured smooth execution during the live tutorial

Innovation & Impact

First Open-Source GenAI Tutorial

This tutorial broke new ground by using entirely open-source tools:

  • No proprietary APIs (OpenAI, Anthropic, etc.)
  • Fully reproducible stack
  • No vendor lock-in
  • Complete transparency in model behavior

Empowering Scientists

Enabled researchers to:

  • Build domain-specific AI assistants
  • Process and query large scientific literature corpora
  • Maintain control over their data and models
  • Understand AI capabilities and limitations

Infrastructure Challenges

Successfully deployed a complex infrastructure to support:

  • Real-time LLM inference for multiple users
  • Vector database operations at scale
  • Interactive Jupyter environments
  • Stable performance throughout 3-hour tutorial

Community Reception

The tutorial received excellent feedback from SciPy attendees, with particular appreciation for:

  • Practical, hands-on approach
  • Open-source commitment
  • Scientific use case focus
  • Comprehensive coverage of RAG pipeline

Future Directions

The tutorial materials serve as a foundation for:

  • Ongoing GenAI education programs at eScience Institute
  • Integration into hackweek curricula
  • Best practices for scientific AI applications
  • Community standards for responsible AI use in research