Cambridge Healthtech Instituteの初開催
Training Data Generation and Quality
トレーニングデータの生成と品質
Ensuring Quality Predictions via Attention to Data Quality
データ品質への配慮による質の高い予測の確保
2025年1月15日 - 16日PST(米国太平洋標準時)
1月15日水曜日
BENCHMARKING, BIAS, AND CONTROLS
ベンチマーキング、バイアス、コントロール
Protein Language Models Are Biased by Unequal Sequence Sampling across the Tree of Life
Frances Ding, PhD, Machine Learning Scientist, Prescient Design, Genentech
Protein language models (PLMs), like all machine learning models, learn biases from data. In this talk I will show that PLMs unintentionally learn a strong species bias. Specifically, PLM likelihoods of protein sequences from certain species (e.g., human, E. coli) are systematically higher, independent of the protein in question. I trace this bias' origins and demonstrate how it can be detrimental for some protein design applications, such as enhancing thermostability.
Improving Antibody Language Models with Native Pairing
Bryan Briney, PhD, Assistant Professor, Immunology & Microbial Science, Scripps Research Institute
We developed Baseline Antibody Language Models (BALM) using Jaffe's dataset of 1.6 million natively paired human antibody sequences. Training with paired sequences (BALM-paired) outperformed unpaired training, demonstrating learning of cross-chain immunological features. ESM-2, a protein language model, showed similar improvements when fine-tuned with paired data. This approach addresses limitations in current antibody models, enhancing our understanding of antibody structure-function relationships. We discuss implications for antibody engineering and therapeutic development.
Refreshment Break in the Exhibit Hall with Poster Viewing2:35 pm
AI Benchmarking Competition Based on High-Throughput Automation and Cloud Lab Experimentation
Peter Kelly, Director, Open Datasets Initiative, Align to Innovate
Align to Innovate, a non-profit research organization, is on a mission to shepherd biology into a data-first discipline through reproducible, scalable, and sharable experimentation. We run a suite of programs that work in conjunction to develop automated wet-lab experimental methods accessible to the community, collect large-scale public protein engineering datasets, and benchmark predictive and generative protein design algorithms. All our work is community-driven, collaborative, and operates under open science principals.
KEYNOTE PRESENTATION: Generation of High-Quality Aggregation Propensity Datasets for Machine Learning by Deep Mutational Scanning and an in vivo Assay
David J. Brockwell, PhD, Professor, School of Molecular and Cellular Biology, University of Leeds
A key requisite of any machine learning campaign is the availability of large volumes of high-quality training data that reports on the property to be predicted. Here we show that a tripartite beta-lactamase assay previously used by our group as a directed evolution screen can be reconfigured into a deep mutational screening format, providing datasets that can subsequently be used to train predictive models for different biophysical properties.
HIGH-THROUGHPUT EXPERIMENTATION
ハイスループットの実験
FEATURED PRESENTATION: High-Throughput Data Generation and Experimental Validation
Gabriel J. Rocklin, PhD, Assistant Professor, Pharmacology, Northwestern University
All proteins continuously fluctuate between different conformational states according to the energies of these states and the barriers between them. Even rare, high-energy states can have large impacts on protein function, aggregation, immunogenicity, and more. These high-energy states are challenging to observe and have never been examined at scale. Using a new high-throughput approach, we quantified protein energy landscapes for 5,000 domains and applied these data to guide protein engineering.
High-Throughput Screens to Validate Model Performance
Amir P. Shanehsazzadeh, Artificial Intelligence Scientist, Absci Corp.
Several in silico metrics have been proposed as a means of assessing antibody design strategies. While these metrics have been utilized to evaluate and benchmark models, there has been little in vitro validation to determine the validity of such metrics. We showcase experiments designed to assess whether or not ranking antibodies by these metrics increases binding rates or binding affinities.
Wednesday Night Meet-Up
水曜日夜のミートアップ
What Needs to be Done to Make a Pipeline of Mini-Binders More Developable?
Monica L. Fernandez-Quintero, PhD, Staff Scientist, General Inorganic & Theoretical Chemistry, Scripps Research Institute
Network, Inspire Others and Connect
- Get to know fellow peers and colleagues
- Make connections and network with other institutions
- Inspire others and be inspired!
We will meet outside of the exhibit hall then transition to the lounge area
1月16日木曜日
Registration and Morning Coffee7:45 am
PLENARY KEYNOTE SESSION
プレナリーセッション(基調講演)
Transforming Therapeutic Protein Engineering
Marissa Mock, PhD, Senior Research Director, Amgen Inc.
Generative biology is an emerging discipline that integrates artificial intelligence (AI) and machine learning (ML) with advanced life science technologies. The application of generative biology to protein engineering is accelerating the discovery and design of complex proteins with therapeutic potential-and, maximizing the benefits of these novel technologies will require seamless integration of both wet- and dry-laboratory technologies.
Session Break8:50 am
DATASET TRAINING FOR SPECIFIC MODELS AND EXPERIMENTS
特定のモデル・実験向けデータセットトレーニング
Training Data Requirements for Antibody-Antigen Binding Affinity Prediction under Multiple Circumstances
Alissa Hummer, PhD, Postdoctoral Researcher, Biochemistry, Stanford University
Antibodies are an important class of medicines, whose efficacy is driven by specific target binding. Given the therapeutic relevance, there have been multiple attempts to predict antibody-antigen binding affinity computationally. I will discuss our findings on how training data influences the success and selection of machine learning strategies to tackle this challenge, ranging from antigen-specific to generalizable and zero-shot affinity prediction.
Enhanced Prediction of Protein-Protein Interface Structure via Augmentation with in vitro Affinity Data
David Noble, Data Scientist II, A Alpha Bio Inc.
The structural complex of a protein-protein interaction (PPI) can yield important mechanistic insights that support drug discovery efforts. Rigid body docking and predictive models such as AlphaFold multimer remain poor quality for difficult but clinically significant systems. Here we present AFInjection, a framework for generating and incorporating experimental data to AlphaFold to improve complex prediction. AFInjection uses affinity data from in vitro directed coevolution of a PPI, finding novel functional sequence pairs which are incorporated into AlphaFold’s features to better infer the parental complex. We demonstrate the utility of this method on antibody-antigen systems and weak PPIs with disordered regions.
Leveraging Novel In Vivo Datasets to Generate Machine Learning Models Predicting Protein Aggregation and Developability
Conor McKay, Researcher, Astbury Centre for Structural Molecular Biology, University of Leeds
Protein aggregation impacts neurodegenerative diseases and biotherapeutic manufacturing. Big data tools like AlphaFold and protein language models excel in biology but rely heavily on high-quality training datasets. This work introduces the tripartite β-lactamase assay, a novel method for generating large, high-quality datasets that link protein aggregation to cell survival, enabling deeper insights into protein behavior and aggregation-related challenges.
Mimic Antibodies and How to Find Them
Brennan Abanades, PhD, Postdoctoral Fellow, Large Molecule Research, Roche
The majority of antibodies in the PDB targeting the same binding site as some other non-antibody proteins are mimic antibodies- they share with the other protein a motif composed of key residues at conserved geometrical positions. By investigating mimic antibodies and how they imitate the binding site of other proteins, we develop a method for identifying them in repertoire data and validate it on IL-18RA.
Coffee Break in the Exhibit Hall with Poster Viewing10:30 am
HT Developability Analysis to Support Model Training
Bismark Amofah, PhD, Senior Scientist, Biologics Engineering, AstraZeneca
Classic developability assays are low-medium throughput and require complex reagent generation. The large, normalized datasets required for ML tool building require adapting or replacing these assays with ones amenable to high throughput automation. We describe our process and results for validating replacement HT developability assays and a HT developability package compatible with very early HT screening.
TABLE 6: Machine Learning in Biologic Drug Discovery: Leveraging External Data Sources
David Noble, Data Scientist II, A Alpha Bio Inc.
- Quantity: Availability challenges, scaling laws, synthetic data
- Quality: Diversity, leakage, reproducibility, quality vs. quantity
- Collaborative data generation: Industry-academia partnerships, data sharing consortia
- Federated learning: Technical challenges, open-source foundation models
- Intellectual property: Data ownership, balancing openness with commercial interests
- Open-source data: Curation quality, integrating diverse sources with proprietary data
Ice Cream & Cookie Break in the Exhibit Hall with Last Chance for Poster Viewing1:10 pm
DATASET GENERATION AND CURATION
データセットの生成とキュレーション
Improving AlphaFold2 Performance with a Global Metagenomic & Biological Data Supply Chain
Geraldene Munsamy, PhD, Senior Scientist, Deep Learning, Basecamp Research Ltd.
Scaling laws estimate over a trillion species exist, yet less than 0.00001% have been studied. Powered by a global metagenomic data supply chain, BaseFold offers improved protein structure prediction with increased accuracy, achieving up to 80% reductions in RMSD values. Leading to more reliable predictions, better docking results, and advancements in therapeutic development, all while incentivizing biodiversity protection.
Curation Strategies for R&D Pipeline Data
Kevin Metcalf, PhD, Associate Principal Scientist, Merck & Co., Inc.
Model-based prediction of biologics developability properties will increase speed to clinic. Previous pipeline program data is a valuable data source for training models but requires data curation, contextualization, annotation, and quality control for this new use. I will describe how we incorporated historical data using data quality control protocols to create reusable data products for machine learning prediction of key attributes, including hydrophobicity and polyspecificity of monoclonal antibodies.
ML Models for Nanobody Developability trained on a Purpose-Built Multi-Readout Dataset
Samuel Demharter, PhD, Senior Data Scientist, Discovery Data Science and Protein Science & Technologies, Genmab
The biophysical characterisation of biologics requires significant wet-lab resources. To enable large-scale predictions of millions of molecules, protein-language models have become an attractive proposition to accurately predict lab readouts. However, current machine-learning models are limited in accuracy largely due to lack of high-quality and high-volume training data. In this talk, we present the generation of a maximally informative dataset for the purpose of training machine-learning models for nanobody developability predictions.
A Machine Learning-Driven Approach for Multi-Parametric Optimization of T Cell Engagers
Winston Haynes, PhD, Head, Data Science & Machine Learning, LabGenius Ltd.
T-cell engagers (TCEs) promise breakthroughs in the treatment of solid tumors, but their progression in the clinic is limited by on-target, off-tumor toxicity. In this talk, I describe how our platform integrates active learning, automation, and high-throughput functional assays to efficiently identify highly selective and potent TCEs. I highlight our utilization of the design-build-test-learn ecosystem to generate high-quality data that powers our machine learning models and therapeutic assets.
Close of BioLogic Summit4:05 pm
*不測の事態により、事前の予告なしにプログラムが変更される場合があります。