Cambridge Healthtech Instituteの初開催

Training Data Generation and Quality
トレーニングデータの生成と品質

Ensuring Quality Predictions via Attention to Data Quality
データ品質への配慮による質の高い予測の確保

2025年1月15日 - 16日PST（米国太平洋標準時）

創薬におけるMLモデルの成功は、トレーニングデータの品質と関連性に大きく依存します。CHIの「トレーニングデータの生成と品質」会議では、MLモデルのパフォーマンスを推進する高品質なデータセットを作成するために最先端の戦略について探ります。出席者は、増幅戦略、バイアスとノイズの軽減手法、コントロールとバリデーションの重要性、特定のアプリケーションでデータセットの適合性を確保する方法について学ぶことができます。この会議では、連合学習やデータ生成プラクティスの標準化など、新興のアプローチについても詳しく説明します。さらに、予測モデリングとハイスループットの自動実験を統合し、アダプティブスクリーニング、アクティブラーニング、in silicoでのベンチマーキングを可能にする、クローズドループシステムの最新の進歩について紹介します。

Day 1
Day 2

1月15日水曜日

1:30 pm

Chairperson’s Remarks

Arvind Sivasubramanian, PhD, Director, Computational Biology & Platform Technologies, Adimab LLC

1:35 pm

Protein Language Models Are Biased by Unequal Sequence Sampling across the Tree of Life

Frances Ding, PhD, Machine Learning Scientist, Prescient Design, Genentech

Protein language models (PLMs), like all machine learning models, learn biases from data. In this talk I will show that PLMs unintentionally learn a strong species bias. Specifically, PLM likelihoods of protein sequences from certain species (e.g., human, E. coli) are systematically higher, independent of the protein in question. I trace this bias' origins and demonstrate how it can be detrimental for some protein design applications, such as enhancing thermostability.

2:05 pm

Improving Antibody Language Models with Native Pairing

Bryan Briney, PhD, Assistant Professor, Immunology & Microbial Science, Scripps Research Institute

We developed Baseline Antibody Language Models (BALM) using Jaffe's dataset of 1.6 million natively paired human antibody sequences. Training with paired sequences (BALM-paired) outperformed unpaired training, demonstrating learning of cross-chain immunological features. ESM-2, a protein language model, showed similar improvements when fine-tuned with paired data. This approach addresses limitations in current antibody models, enhancing our understanding of antibody structure-function relationships. We discuss implications for antibody engineering and therapeutic development.

Refreshment Break in the Exhibit Hall with Poster Viewing2:35 pm

3:10 pm

AI Benchmarking Competition Based on High-Throughput Automation and Cloud Lab Experimentation

Peter Kelly, Director, Open Datasets Initiative, Align to Innovate

Align to Innovate, a non-profit research organization, is on a mission to shepherd biology into a data-first discipline through reproducible, scalable, and sharable experimentation. We run a suite of programs that work in conjunction to develop automated wet-lab experimental methods accessible to the community, collect large-scale public protein engineering datasets, and benchmark predictive and generative protein design algorithms. All our work is community-driven, collaborative, and operates under open science principals.

3:40 pm

KEYNOTE PRESENTATION: Generation of High-Quality Aggregation Propensity Datasets for Machine Learning by Deep Mutational Scanning and an in vivo Assay

David J. Brockwell, PhD, Professor, School of Molecular and Cellular Biology, University of Leeds

A key requisite of any machine learning campaign is the availability of large volumes of high-quality training data that reports on the property to be predicted. Here we show that a tripartite beta-lactamase assay previously used by our group as a directed evolution screen can be reconfigured into a deep mutational screening format, providing datasets that can subsequently be used to train predictive models for different biophysical properties.

4:10 pm

FEATURED PRESENTATION: High-Throughput Data Generation and Experimental Validation

Gabriel J. Rocklin, PhD, Assistant Professor, Pharmacology, Northwestern University

All proteins continuously fluctuate between different conformational states according to the energies of these states and the barriers between them. Even rare, high-energy states can have large impacts on protein function, aggregation, immunogenicity, and more. These high-energy states are challenging to observe and have never been examined at scale. Using a new high-throughput approach, we quantified protein energy landscapes for 5,000 domains and applied these data to guide protein engineering.

4:40 pm

Design of Multifunctional Antibodies with Generative AI and High-throughput Data Iteration

Wei Lu, PhD, Director, AI Drug Design, Aureka Biotechnologies, Inc.

5:10 pm

High-Throughput Screens to Validate Model Performance

Amir P. Shanehsazzadeh, Artificial Intelligence Scientist, Absci Corp.

Several in silico metrics have been proposed as a means of assessing antibody design strategies. While these metrics have been utilized to evaluate and benchmark models, there has been little in vitro validation to determine the validity of such metrics. We showcase experiments designed to assess whether or not ranking antibodies by these metrics increases binding rates or binding affinities.

5:40 pmClose of Day

6:00 pm

What Needs to be Done to Make a Pipeline of Mini-Binders More Developable?

Monica L. Fernandez-Quintero, PhD, Staff Scientist, General Inorganic & Theoretical Chemistry, Scripps Research Institute

Network, Inspire Others and Connect

Get to know fellow peers and colleagues

Make connections and network with other institutions

Inspire others and be inspired!

We will meet outside of the exhibit hall then transition to the lounge area

Day 1
Day 2

1月16日木曜日

Registration and Morning Coffee7:45 am

8:15 am

Chairperson's Remarks

M. Frank Erasmus, PhD, Head, Bioinformatics, Specifica, an IQVIA business

8:20 am

Transforming Therapeutic Protein Engineering

Marissa Mock, PhD, Senior Research Director, Amgen Inc.

Generative biology is an emerging discipline that integrates artificial intelligence (AI) and machine learning (ML) with advanced life science technologies. The application of generative biology to protein engineering is accelerating the discovery and design of complex proteins with therapeutic potential-and, maximizing the benefits of these novel technologies will require seamless integration of both wet- and dry-laboratory technologies.

Session Break8:50 am

8:55 am

Chairperson’s Remarks

Bismark Amofah, PhD, Senior Scientist, Biologics Engineering, AstraZeneca

9:00 am

Training Data Requirements for Antibody-Antigen Binding Affinity Prediction under Multiple Circumstances

Alissa Hummer, PhD, Postdoctoral Researcher, Biochemistry, Stanford University

Antibodies are an important class of medicines, whose efficacy is driven by specific target binding. Given the therapeutic relevance, there have been multiple attempts to predict antibody-antigen binding affinity computationally. I will discuss our findings on how training data influences the success and selection of machine learning strategies to tackle this challenge, ranging from antigen-specific to generalizable and zero-shot affinity prediction.

9:30 am

Enhanced Prediction of Protein-Protein Interface Structure via Augmentation with in vitro Affinity Data

David Noble, Data Scientist II, A Alpha Bio Inc.

The structural complex of a protein-protein interaction (PPI) can yield important mechanistic insights that support drug discovery efforts. Rigid body docking and predictive models such as AlphaFold multimer remain poor quality for difficult but clinically significant systems. Here we present AFInjection, a framework for generating and incorporating experimental data to AlphaFold to improve complex prediction. AFInjection uses affinity data from in vitro directed coevolution of a PPI, finding novel functional sequence pairs which are incorporated into AlphaFold’s features to better infer the parental complex. We demonstrate the utility of this method on antibody-antigen systems and weak PPIs with disordered regions.

10:00 am

Leveraging Novel In Vivo Datasets to Generate Machine Learning Models Predicting Protein Aggregation and Developability

Conor McKay, Researcher, Astbury Centre for Structural Molecular Biology, University of Leeds

Protein aggregation impacts neurodegenerative diseases and biotherapeutic manufacturing. Big data tools like AlphaFold and protein language models excel in biology but rely heavily on high-quality training datasets. This work introduces the tripartite β-lactamase assay, a novel method for generating large, high-quality datasets that link protein aggregation to cell survival, enabling deeper insights into protein behavior and aggregation-related challenges.

10:15 am

Mimic Antibodies and How to Find Them

Brennan Abanades, PhD, Postdoctoral Fellow, Large Molecule Research, Roche

The majority of antibodies in the PDB targeting the same binding site as some other non-antibody proteins are mimic antibodies- they share with the other protein a motif composed of key residues at conserved geometrical positions. By investigating mimic antibodies and how they imitate the binding site of other proteins, we develop a method for identifying them in repertoire data and validate it on IL-18RA.

Coffee Break in the Exhibit Hall with Poster Viewing10:30 am

11:00 am

HT Developability Analysis to Support Model Training

Bismark Amofah, PhD, Senior Scientist, Biologics Engineering, AstraZeneca

Classic developability assays are low-medium throughput and require complex reagent generation. The large, normalized datasets required for ML tool building require adapting or replacing these assays with ones amenable to high throughput automation. We describe our process and results for validating replacement HT developability assays and a HT developability package compatible with very early HT screening.

11:30 amInteractive Breakout Discussions

TABLE 5: Internal Data Generation and Curation

Kevin Metcalf, PhD, Associate Principal Scientist, Merck & Co., Inc.

Amplification strategies
Avoiding bias
Closed-loop experimentation
Controls and validation
Dealing with skewed data
Historical data

TABLE 6: Machine Learning in Biologic Drug Discovery: Leveraging External Data Sources

David Noble, Data Scientist II, A Alpha Bio Inc.

Quantity: Availability challenges, scaling laws, synthetic data
Quality: Diversity, leakage, reproducibility, quality vs. quantity
Collaborative data generation: Industry-academia partnerships, data sharing consortia
Federated learning: Technical challenges, open-source foundation models
Intellectual property: Data ownership, balancing openness with commercial interests
Open-source data: Curation quality, integrating diverse sources with proprietary data

12:30 pmSession Break

12:40 pm

LUNCHEON PRESENTATION: State Diagram Embeddings to Ground Protein Models in Physical Reality: Single Shot Biophysical Classification of Antibodies

Shamit Shrivastava, CoFounder & CEO, Apoha

Ice Cream & Cookie Break in the Exhibit Hall with Last Chance for Poster Viewing1:10 pm

2:00 pm

Chairperson’s Remarks

Geraldene Munsamy, PhD, Senior Scientist, Deep Learning, Basecamp Research Ltd.

2:05 pm

Improving AlphaFold2 Performance with a Global Metagenomic & Biological Data Supply Chain

Geraldene Munsamy, PhD, Senior Scientist, Deep Learning, Basecamp Research Ltd.

Scaling laws estimate over a trillion species exist, yet less than 0.00001% have been studied. Powered by a global metagenomic data supply chain, BaseFold offers improved protein structure prediction with increased accuracy, achieving up to 80% reductions in RMSD values. Leading to more reliable predictions, better docking results, and advancements in therapeutic development, all while incentivizing biodiversity protection.

2:35 pm

Curation Strategies for R&D Pipeline Data

Kevin Metcalf, PhD, Associate Principal Scientist, Merck & Co., Inc.

Model-based prediction of biologics developability properties will increase speed to clinic. Previous pipeline program data is a valuable data source for training models but requires data curation, contextualization, annotation, and quality control for this new use. I will describe how we incorporated historical data using data quality control protocols to create reusable data products for machine learning prediction of key attributes, including hydrophobicity and polyspecificity of monoclonal antibodies.

3:05 pm

ML Models for Nanobody Developability trained on a Purpose-Built Multi-Readout Dataset

Samuel Demharter, PhD, Senior Data Scientist, Discovery Data Science and Protein Science & Technologies, Genmab

The biophysical characterisation of biologics requires significant wet-lab resources. To enable large-scale predictions of millions of molecules, protein-language models have become an attractive proposition to accurately predict lab readouts. However, current machine-learning models are limited in accuracy largely due to lack of high-quality and high-volume training data. In this talk, we present the generation of a maximally informative dataset for the purpose of training machine-learning models for nanobody developability predictions.

3:35 pm

A Machine Learning-Driven Approach for Multi-Parametric Optimization of T Cell Engagers

Winston Haynes, PhD, Head, Data Science & Machine Learning, LabGenius Ltd.

T-cell engagers (TCEs) promise breakthroughs in the treatment of solid tumors, but their progression in the clinic is limited by on-target, off-tumor toxicity. In this talk, I describe how our platform integrates active learning, automation, and high-throughput functional assays to efficiently identify highly selective and potent TCEs. I highlight our utilization of the design-build-test-learn ecosystem to generate high-quality data that powers our machine learning models and therapeutic assets.

Close of BioLogic Summit4:05 pm