FYI - For Your Innovation: Bettering Human Health Through Artificial Intelligence with Sean McClain and Joshua Meier of Absci
ARK Invest 5/18/23 - Episode Page - 58m - PDF Transcript
Welcome to FYI, the four-year innovation podcast.
This show offers an intellectual discussion on technologically-enabled disruption because
investing in innovation starts with understanding it.
To learn more, visit arc-invest.com.
Arc Invest is a registered investment advisor focused on investing in disruptive innovation.
This podcast is for informational purposes only and should not be relied upon as a basis
for investment decisions.
It does not constitute either explicitly or implicitly any provision of services or products
by Arc.
All statements may regarding companies or securities are strictly beliefs and points
of view held by Arc or podcast guests and are not endorsements or recommendations by
Arc to buy, sell or hold any security.
Clients of Arc investment management may maintain positions in the securities discussed in
this podcast.
My name is Simon, Arc's Director of Life Sciences Research, and today we'll be discussing
Absi, a public company harnessing generative AI to create more effective medicines faster
and less expensively.
I'm joined by Sean McClain, Absi's founder and CEO, as well as Joshua Meyer, Absi's
Chief AI Officer.
Thanks for taking the time, guys.
Yeah, absolutely.
Thanks so much for having us here, Simon.
So, you know, our audience hasn't had the luxury, like me, of getting to know you guys
beforehand, so before we dive into the company, Sean and Josh, would you mind briefly talking
to us a little bit about yourselves and also why you're so passionate about the role of
AI in drug development?
Yeah, absolutely.
So I founded the company 12 years ago, actually, in a basement lab.
And the original idea was not applying generative AI to biologic drug discovery.
It was actually to engineer E. Coli to produce antibodies.
And we're the first company to be able to produce an antibody in E. Coli.
And what that actually enabled you to do was actually do what's called a pooled approach.
You could basically take a billion-member antibody library, put it into a test tube of
E. Coli, and now you have a billion antibodies produced.
And this gave us a huge data advantage, being able to essentially produce these and screen
these at very, very high throughput.
And it ended up becoming a realization to me about three years ago that, wow, this is
what is needed for generative AI to really unlock biologic drug discovery or what we
like to call drug creation.
It's essentially going from this paradigm of drug discovery, where you're looking for
a needle in the haystack, to drug creation, where you're actually creating the needle.
And in our case, it's the biologic.
Being able to get the biologic or the antibody with all the attributes you want, the first
go around.
Mm-hmm.
So I think a lot of people are probably familiar with the terms antibody and antigen and understand
that antibodies are a really critical component of the body's immune system and help us to
attack and defend against disease.
But for those who may not have a very intimate understanding of what the drug discovery and
development processes look like, specifically for antibodies, which is what you guys are
focused on, would you mind briefly zooming in and discussing the importance of the things
you talked about, like with yeast and generating these libraries?
I think that would help people understand a little bit more context.
Yeah, no, absolutely.
So unlike a small molecule or a pill in the bottle where you have a chemist making it,
you have to have a living organism make an antibody.
And that's for the production.
But then to discover an antibody, you then have to use usually a mouse or what's called
phage display or yeast display.
I'll focus in on the mouse where Regeneron was really the pioneer in creating a humanized
mouse where you essentially could take, let's say, a cancer target or antigen of interest.
You inject it into the mouse.
The mouse uses its immune system to then create antibodies towards that target.
And then you extract out the blood and you're able to then find antibodies to a given target.
But the issue with that is that you can't tell a mouse to generate an antibody that
hits the specific location of the target that you want.
That has the affinity or how tightly it binds to a target or the developability or manufacturability.
You have no control over how a mouse generates an antibody.
And that's what really leads to these long times lead times to get into the clinic as
well as low success rates in the clinic of less than 4%.
And again, what we're doing is completely changing that paradigm and using AI to design
the attributes you want the first go around and actually have control over the biology
for the first time.
So before I kick it over to Josh, now that it seems like we've fixed the Wi-Fi, I do
want to double underline something you said there, Sean, because I think it's going to
come up time and time again as we talk about this is when you're trying to develop an antibody
against a particular antigen or disease target, regardless of the disease, the idea is that
you're exploiting the specific binding reaction and you use the word affinity between those
two things, almost like a lock and a key.
And I like that you walked through the first vestiges of this approach with using animal
models and immunization, which has its pros and cons.
But moving further downstream, and we'll talk about the in vitro, like the yeast display
and E. coli and the bacterial display technology.
And then the ultimate golden goose of how much of this can we just do on a computer?
And so we'll get into that.
But before we do, Josh, I'm going to flip it over to you just for a brief intro on yourself,
how you got involved with AbSci and why you're so passionate about the role of AI in antibody
discovery and development.
Yeah, absolutely.
Thanks for having us, Simon.
So a bit about my background.
I've been working at this intersection of AI and biology really before this thing became
cool.
It sounds like everyone these days has kind of established that this is going to take
over the industry.
But when I first started working on this, everyone thought it was crazy.
Like, why are you vetting your career on AI for biology?
Just go work on at least traditional AI.
Even that wasn't as big then.
I started training my first neural networks back in 2013.
This is when we first got deep learning, really working on GPUs and the space started
to explode.
But I come from a family of doctors.
I was always excited to kind of deploy this technology to just better human health.
So I was always trying to find an intersection, but just the technology wasn't there yet.
AI wasn't good enough.
The data wasn't good enough.
And the problem space wasn't really worked out yet.
But fast forward a couple of years later, I was working at OpenAI.
And this was around the time of GPT-1.
So we were first seeing the first signs of life that language models could learn some
really interesting things.
And of course, within OpenAI, we were kind of like the believers of this stuff.
We felt that everything we're seeing today was going to happen eventually.
But the application I was really excited about then was like, well, if we can get this thing
to translate between languages or generate new text, could you use this to generate new
biological text?
Can you just output DNA sequences and protein sequences?
And felt that'd be even bigger than just the NLP stuff.
Because if you just think about NLP, you and I can just write stuff on a computer, but
we can't sit down and start clinking out DNA sequences.
You're not going to get anything interesting.
So left OpenAI shortly after that GPT-1 project to go join Meta and help start up an AI for
science initiative there.
And that's where we did this first work on language models on protein sequences, published
some of the first papers in this area demonstrating that these models could learn some really
interesting things.
And a couple of years into that, really the thing that was clear it was missing is that
when you're working in a Facebook or a Meta, you have really strong conviction in AI, which
allows you to go into these areas like AI for science and AI for biology.
But the thing that you're then missing is that wet lab component.
So how do you actually validate these designs?
How do you build differentiated training data?
And it became clear that this stuff was going to work for biology and who was actually going
to read the value.
I felt that at some point it wasn't going to be as much meta as it was a fantastic place
to do this kind of research, but sort of aligning with a very forward-looking biotech company
that was also going through this generative AI transformation made a lot of sense.
So to that point, I kind of connected with Sean and these visions really started to collide.
And I joined AppSign.
It's been pretty amazing to see the kind of research and science that can happen in such
a short time frame when you really bring these two differentiated edges together.
Yeah.
And it's actually a really funny story how Joshua and I met.
The team had put together a list of probably the top 50 AI researchers in the space and
gave it to me.
And I went on LinkedIn and I wrote emails to all of them.
And Joshua responded.
And yeah, him and I totally clicked on the vision and what we could do together.
And I would say the rest is history.
Absolutely.
Well, I wanted to maybe segue back to the main conversation around this point that you're
making Josh around the data and what you're able to do at a big tech company versus a
company that's trying to combine multiple disciplines, which honestly feels like the
field of life sciences is like inexorably headed is like a complete breakdown between
the walls of AI and biology.
And if you look at some of the large language models that are being used with human language,
you can train them on enormous amounts of information, ostensibly the whole internet
and all the texts they're in.
But the issue with life sciences is like a lot of the data generation techniques have
been artisanal, poor quality control, databases are fragmented and poorly annotated, and sure
we're improving along all these dimensions, especially with sequencing and the breakout
explosion and cost decline of DNA sequencing and other kind of ancillary technologies.
But I wanted to focus the conversation on data for a moment and talk about it in the
context of Abside.
Sean, you mentioned billions of data points with the E. Coli expression.
And Josh, you know, you and I have had this previous conversation around data being the
rate limiting step in training these models.
So I wanted to discuss both of those points together and really focus the conversation
back on Abside and what you've done between in vitro and in silico work.
Yeah, Sean, why don't you kick it off as the one who really invented like the core microbial
platform we're using here?
Absolutely.
Yeah, as I had previously talked about, we were the first to engineer a very simple
organism E. Coli to produce an antibody.
And I guess stepping actually back for just one moment, like, how are antibodies currently
produced?
They're produced in mammalian cells or choe cells, and you can really only scale that
up to maybe producing thousands or tens of thousands of antibodies in a given week.
And that's just not enough throughput or data to actually start training models.
And so by being in a microbial organism and engineering E. Coli to produce antibodies,
what you can do is what's called a pooled approach.
You can basically take a test tube of your engineering E. Coli, take a billion member
DNA antibody library that encodes a billion different antibodies, throw that into your
test tube, and now you have every single E. Coli making a different antibody.
So in that single test tube, I now have a billion antibodies that have been produced.
So I've gone from thousands or tens of thousands of antibodies to billions.
Now the second question you then have to ask or solve is the functionality.
Now that we've produced them, what is the functionality or potential efficacy of these?
And this is where we develop this really breakthrough assay that we call our ACE assay, where we're
able to interrogate every single E. Coli and look at the binding affinity or how tightly
it binds to a target of interest.
And so in that experiment, we can then be able to look at billions of protein-protein
interaction data points.
And when you're developing an antibody, there's really two important aspects outside of the
developability and manufacturability.
It's where does the antibody bind the location or what we refer to as the epitope?
And then how tightly is it binding?
Does it have high affinity or low affinity?
And these are the two attributes we're really able to hone in on very rapidly and train
our AI models.
And I'll let Joshua talk about this.
But this data has allowed us to really build these extraordinary models or train these
models that have allowed us to actually have a huge breakthrough in the industry, where
we were the first to use generative AI to design an antibody from scratch on a computer.
And actually, I think this is a really great time to hand it over to Joshua on what that
is and how we take this data, train our models to ultimately kind of see our big vision through
of being able to design a biologic at a click of a button.
Sure.
Thanks, Sean.
So if we look about how we're using data at AppSign and even just taking a step back,
first of all, and the importance of data in this space, I think the whole language modeling
world is waking up to this today.
You look at something like chatGPT and GPT-4.
One of the big things that's advertised there is something called reinforcement learning
from human feedback.
And the thing really to think about there is the human feedback part, where you can go
and scrape basically infinite amounts of data from the internet, although some would say
it's not really infinite.
We're almost running out of tokens now to feed these models as we're continuing to scale.
But if you just think about that human feedback, it's really critical to give these models
this very chat-like capability.
When you're training it on, people actually interfacing with the model and teaching it
in a very direct manner.
And the data is even more impactful than just finding random data on the internet because
the model is involved in that data collection process.
So in that view, this is something that we've had at AppSign for a couple of years now.
And we've really built up the experimental platform and AI integration accordingly.
So specifically what that means is that almost all the data we're collecting in the lab is
actually AI-designed.
So if we go in the lab and we're going to go collect 100,000 or a million data points,
a million unique sequences, these aren't just random sequences that you find on the
internet or find in a mouse immunization campaign.
These are sequences that the AI model is designing.
Or rather, the AI scientist is designing.
Sometimes there's different benchmarks or baselines that you want to throw in.
There are different controls.
But at a high level, it's the machine learning model is actually creating that data, telling
us what data it wants us to collect.
And it's very similar then to this reinforcement learning from human feedback idea.
I mean, on a technical level, it's not exactly the same, but at a high level, it's the same
sort of finding that if you allow the model to help you generate the data, you actually
end up with a really nice flywheel there, where then you get that data back, the model
becomes smarter, and then you can do this again.
So that's really allowed us, I think, to scale the model really quickly.
It's also allowed us to just run massive number of experiments and see what works.
Machine learning is a very empirical field.
A lot of people used to refer to deep learning as black magic, where you would just have
AI scientists who just have some intuition about how the models work.
And that's still, in a sense, how you come up with the next generation of these architectures,
like the transformer that people are scaling these days in NLP.
Like, how did people come up with that?
I mean, they can kind of give you reasons about it.
But at the end of the day, there's just really strong intuition that goes into this that's
built up through just a lot of time spent training these models and evaluating them.
And that's where this experimental feedback loop is also critical.
So the AI scientists can be doing dozens of experiments in a month, take different kinds
of models that they're training, test all of them in the lab, and really start to develop
insight and intuition about what's working and what isn't.
And I also would credit as one of the fuels to many of our recent successes.
If I distill it down to a couple of key points here, it's like, you know, you have to solve
for a few things to make this type of project work.
The first is, you know, you're generating a ton of data.
You've created a in vitro technique to generate a ton, you know, billions of data points.
The second half of it is you have to actually create, you know, features and labels and get
that functional data out, you know, for every part of that library.
So you've done that with the ACE assay.
And then you're describing, and I actually wanted to dig into a nuanced point here,
because I'm not sure if it comes across every time, which is that you're getting data on
every different, you know, combination or perturbation that you're having in these libraries,
not just the top decile, you know, high affinity binders, like in other types of display
technologies, you know, you're basically you're physically like washing away all the things
that don't stick because they don't stick.
And you're, you know, maybe this is the wrong way to think about it, but you're kind of
biasing that data set towards only the things that work.
And if you're talking about reinforcement learning and like, you know, penalties and
loss, I want to get into that as well to try to understand the importance of actually
collecting the whole thing, not just, you know, the fraction that works.
Yeah, that's that's absolutely right.
So we've actually developed a number of variations to our core, what we call ACE assay,
based on that microbial system that that Sean introduced before.
And we can run the assay in a number of ways.
So one way we can run it is similar to past techniques in a binary way, where we're just
looking at whether a sequence is binding or not binding.
And what's really nice about that is there are some cases where you just really want to profile
hit rates in a very accurate way.
We find that when we run the assay that way, that precision recall of that assay compared
to like lower throughput, but kind of gold centered assays is over 95%.
So that means that the information we're getting out there is highly reliable for us to
compare the hit rates of different models to each other.
But then we've also developed an alternative way of screening, which actually gives us
quantitative information.
So this is something that's very difficult to get with a traditional phage display or
a yeast display, at least to do it in a batched way.
What this means is we can go screen hundreds of thousands of sequences, and then we can
get a quantitative label for each of those sequences, a score.
And that score correlates very well with gold standard measurements of affinity.
You're seeing piercing and spearmint correlations above 0.8 in most cases.
So taken together, these are really the two tools that you want for any body engineering.
First, you want to identify a binder and you want to be able to profile like what fraction
of your sequences are actually binders.
And then among those binders, you want to be able to profile the affinity in a highly
quantitative way in a also very robust and accurate manner as well.
There's usually a tradeoff in biology between throughput and accuracy.
And I think we're at a very sweet spot with the assays that we've tuned over time to be
able to get very meaningful information for the models.
Yeah.
And I think there's a really important point that Joshua hit on.
It's not just using this wet lab technology to get data to train the models, but it's the
validation as well.
There's a lot of manuscripts that come out that don't show wet lab validation, but every
single model that we design and we train on, we then go into the wet lab and validate it.
And we can validate roughly three million AI-generated designs in a given week.
And so training on billions and then validating on millions and then being able to have a cycle
time, which you can go through all that in a six-week time period, just allows you to make
very, very rapid progress in a way in biology that really hasn't been done before.
Before I get into some of the manuscripts, which I'm eager to talk about, I did want to ask
just this general question.
I'm going to lean a little bit on a blog that a friend of mine, Pablo Lubroth, wrote, I think
last year about KPIs in AI-enabled drug discovery.
You know, and it pulled from a lot of analogies from the SaaS industry, right?
Like all these investors that are looking at SaaS companies, there's like a lingua franca of,
okay, we'll use terms like ARR and we have like very rigid definitions of like what everyone's
managing to, but on both the investment side and the entrepreneur and the founder side.
And as the space matures, you know, as an investor, and I'm sure a lot of people are in the same
position, like you get fatigued hearing about every possible mashup of AI and drug discovery
because there are a lot of them, right?
And so I appreciate the point you're making, Sean, about the importance of wet lab, you
know, gold standard validation is like a key part of that.
But I wanted to ask, like, what are some red flags and some green flags to you when you're
thinking about AI and drug discovery come together?
What are the things where you're like, oh, this is, you know, legitimately differentiated
or there's some value here.
And also to the extent that you're able to talk about it, I'd love to know what are the KPIs
that matter to you as you're tracking your own progress against, you know, this ability to
generate fully in silico antibodies without clear, you know, commercialization events or
contracts, like what are those metrics?
Yeah, absolutely.
I would say that there's like three key metrics that we look at.
Well, you know, first is, you know, being able to specifically hit an epitope that you want.
So again, an area of the target that's of interest and being able to hit that and not
having, you know, any sort of poly specificity towards anything else.
The second is then being able to have the accuracy of the model be good enough to predict the
exact binding affinity that you want, you know, let's say you wanted in this particular
instance to get the biology wanted, you wanted a medium binder, you can have the model generate
that for you.
The third aspect is, you know, I totally lost my train of thought on the third aspect, but I
will hand it over to Joshua.
If you agree on kind of those, at least those two kind of major areas of, you know, epitope
specificity, as well as being able to, you know, just predict the affinity.
And actually the third thing that I was actually going to mention was the accuracy of the model.
So can we hit the epitope 25% of the time, you know, 25% of what is coming out of the model is
hitting that epitope, you know, and, you know, ultimately increasing that over time.
So being able to get up to greater than a 90% accuracy is really where we want to be.
Yeah.
So those are exactly some of the problems that we're focused on at Absight right now.
I think one of the exciting things about this space is that it's moving very quickly and
those KPIs will definitely continue to update over time as well.
Right.
So it's not like you were talking in a business setting, right?
There's like some AIR, very clear metric that you want to go after it.
What is your revenue for SaaS business?
I think in AI drug discovery, you have to be more creative than that.
You have to think through what are sort of the unmet needs that your application can
bring in and just be laser focused on solving for those.
And then once you solve those, then you kind of move on to the next problems afterwards.
So like Sean mentioned, some of the things that are just not, you can't really do with
existing assets or things like epitope specificity.
It's something that makes a lot of sense for an AI model because you can think of it as
like prompting.
You can prompt the model.
I think it should chat to you sort of sunny prompt the model with a specific epitope that
you want to hit or potentially other properties that you want your molecules to sort of fit.
Another thing is about accuracy as well.
So you want a model that's very calibrated.
One of the things that we see with our models is that it's a phenomenon that we're calling
hit rate decay.
As you screen more and more sequences from the models, the accuracy started to go down.
You know, we're screening hundreds of thousands of sequences here.
And we're like, wait, you know, at some point, doesn't the model run out of binders?
You know, where, where are all the binders?
You know, is it in the first couple of the model is pulling out, or is it all the way
through the end?
Do you really need to screen 100,000 in order to discover a binder?
And turns out the answer today is no, like we don't have to do that anymore because the
model is actually giving us sequences from the first, let's say, 10 or first 100 that
you're pulling out of the model.
So that's, for example, another KPI that we look for.
And we've put a lot of thought into the kinds of metrics and statistics that we've developed
to sort of evaluate our models in that way.
So this is another way that we're able to really benchmark our models against each other.
And therefore, you know, abilities are focused on making progress here.
And the other thing I'll mention too is that we're not just developing, you know, AI within
biologic drug discovery for, for the sake of it, like we're really wanting to utilize
this to be able to discover, you know, new biologies.
You know, one of the things that we're really excited about is actually utilizing this sort
of technology to generate antibodies towards GPCRs, which are notoriously hard to drug
with standard immunization or phage display approaches and being able to specifically
target, you know, the epitobon of GPCR.
This really starts to unlock new biology and new targets, which is ultimately going to be,
you know, best for, for patients.
And, you know, it, you know, becomes highly differentiated, which we're, we're really
excited about.
And I think that's what you're going to start to see more and more is, is generative AI
unlocking new, new biology in a faster way than we've ever seen before.
And so earlier, Sean, too, you mentioned this concept of humanization.
When you were talking about, I think a transgenic mouse model, I wanted to blow up
this point about humanization as well.
Because we've talked a lot about affinity.
And of course, there are multiple things that can go wrong, you know, even if you have
an excellent binder.
And I wanted to talk about this a little bit because it's the subject of one of your
papers, too, on this concept of naturalness, which is like, forgive me if I mischaracterize
it, but it's sort of like an ensemble of a lot of different, you know, key aspects of
what it means to have an antibody that is developable, meaning it can be manufactured.
It's hopefully rid of downstream liabilities, you know, a human body is able to take it
in without any sort of, you know, unwanted immunogenic or, you know, side reaction.
So I wanted to learn about this like multi layer Swiss cheese model of optimization,
you know, past just affinity.
And maybe a little bit more of a technical add on to that question is, are these
optimizations happening in parallel, or is it more sequential?
Right? Does that make sense?
Like, you know, you're starting and kind of going downstream.
So I'd love to know more about one.
Yeah, no, absolutely.
So if you just look at an antibody sequence, like there's more sequence variants or drugs
you could design than there are atoms in the universe.
And so the search space is ginormous.
And, you know, you look at like evolution, like a mouse or humans evolved to have a
particular immune system, and they're going to have a particular immune repertoire where
they design certain, you know, antibodies that, you know, don't have a immunogenic response.
And essentially, you know, that's what humanization is, is making sure that the antibody
you design isn't going to be targeted by the immune system and have what's called an anti
drug antibody.
And when you look at, again, the overall possibilities and our ability now to search,
to start to search that space, you start to become concerned with, you know,
immunogenicity, like is what the model is designing these kind of sequences that have the
same functionality, but are very different from, you know, what looks like a normal antibody.
So that's when we started to build out this naturalness model, which I'll have Joshua kind
of talk about that.
But what we showed with this model is that it's inversely correlated to immunogenicity and has
pretty good correlation to developability and manufacturability.
And so you're ensuring not only are you getting, you know, the functionality out of the antibody
you want, but you're also ensuring that you have low immunogenicity or it's as human like as
possible when going into the clinic, because there are clinical trials that do fail due to
high immunogenicity.
And so being able to control for both of these is really important, you know, aspect.
And I'll let Joshua kind of dive into, you know, how we went about, you know, doing this and then,
and then how are we, you know, doing this in terms of are we doing this in parallel at the same
time, or is it a sequential?
Yeah, so when we think about the naturalness model, the key insight that we were trying to build
towards is how can we take sort of this universe of antibodies that we know about and then use that
information to distill the key factors that make an antibody so-called natural.
And that's really what the naturalness score is.
So the way that we train the model is we took hundreds of millions of antibody sequences that you find
within humans, that you find within animals, really naturally occurring.
And then we asked the model to give us some score.
So given a new sequence, what is essentially the likelihood that you would see that sequence within one
of these natural immune repertoires?
And it turns out, and, you know, it's kind of intuitive that this is the case.
But if a sequence of the model thinks it's more likely to have been found in immune repertoires, then
it's more likely to have all of these properties that Sean was talking about before, like the
developability properties having low immunogenicity.
And the reason why we built the model this way is it turns out that there is significantly more
immune repertoire data that you can get access to than there is even some of the developability
and immunogenicity data that's out there.
So on a fundamental level, you can think of the model as like bootstrapping from all the
information that's available here.
And it's a very similar insight to what you observe with, you know, something like a GPT3 or a
chat GPT, right, where you pre-train the model and lots of information.
And then the model is sort of learning, like, what are the real semantic rules that go into
language? We're doing the same thing here now for what are the rules that go into an antibody.
OK. And I imagine, like, if you're if you're working along some, you know, like multi
parametric optimization, you want to constrain this space as much as you can with the early
steps. And so is what you're doing basically, like, you know, before you even get down to
like what could be a good binder or not, let's throw out all the things that we know are
going to be like highly toxic, or maybe they have like a premature stop codon or something,
you know, that would truncate it like, well, you know, where along the process are these
different steps working, I guess is what I'm curious about.
So for each one of these things we're building, there's actually multiple ways we can kind of
combine them together, depending on the problem that we're trying to solve.
It goes back to your points earlier about a KPI, for example, you need to be very thoughtful
about what you're applying this technology to.
One of the issues I think that the the field has broadly is you have a lot of smart AI
people who are building hammers and looking for nails.
And, you know, as we all know, you know, if you start with the nail, you know, as the
saying goes, it's going to be easier to find that solution.
So when we think about combining these together, it really depends on the problem that we're
trying to solve. So usually in the case of like a drug discovery campaign, you're going
to bring a naturalness as a way to select the molecules that are most interesting to you.
So we're at a point now where we can take our AI models, and then we can come up with
hundreds, even thousands of potential sequences that could be brought forward for your preclinical
testing. And the question is, how do you prioritize between these hundreds of different
molecules? And that's where we want to bring in this information about naturalness, for
example. That's one way that we use it. Another way that we can use it is actually just
bring all these properties together from the start.
So maybe you have some lead, and what you'd like to do is optimize that lead for various
properties. Maybe you don't have any sequence that has the affinity or naturalness profile
that you'd like, and you can just co-optimize for all these properties together. So that's
another one of the ways that you might use this information.
I imagine, you know, once you get past affinity and specificity, we're just trying to like
systematically remove as many of the downstream like gotchas and surprises that keep drugs from
ultimately making it to, you know, commercialization. And so we've talked a lot about some of these,
I guess you could call them known unknowns, like we know what they are, but we just don't
know where they rank or how dangerous they are. And I would imagine like over the years,
people have always kind of had this like, you know, maybe it's hubris, but maybe it's also
just strategy or thinking about the problem. But you try to get rid of all these downstream
question marks. And like, is there a set of just, you know, unknown unknowns, things that we're not
even really able to query or look for that could still be a surprise? Like I think about this with,
and I'll use a specific example to maybe guide how I'm thinking about this, but like
post-translational modifications that are not like directly being measured by DNA sequencing,
or even in some cases like, you know, in like mass spec or protein sequencing.
If those show up in a, in a CDR or one of those active regions, like you could still,
I imagine, change the binding properties of an antibody. And I know there are some tools
to predict those liabilities and things like that, but I'm generally just curious about like the
unknown unknowns with antibody development and like places where you think no one's really
paying attention to like, you know, how to ensure that these things actually do make it
all the way to the end and to the finish line.
Yeah, absolutely. I mean, I think like one thing that comes to mind is a lot of
developability, you know, attributes. Let's say that, you know, you end up finding out that you
can't, you know, you can't get good enough viscosity in order to do sub-sub-q dosing.
And so you're having to do like, you know, infusion instead of being able to do sub-q
injections. And so, and it all comes kind of back to data. Like, you know, that's actually one area
that I think like pharma actually has a lot of data on is the developability, you know, attributes.
You know, like us, like, we don't have technology to scale up, you know, being able to screen for
viscosity in a kind of a high throughput manner. And so for us, that could potentially be,
you know, a blind spot for us is, you know, we ultimately get down the road. And it's like,
we really want to do sub-q dosing, but we don't have the data to, you know, train our models to
predict for that. And I think that's where, you know, some really interesting ideas that we've
always had of like, how could, you know, some of the data that large pharma has that's not
necessarily for the drug itself or like the efficacy and the functionality, but it's kind of
on the developability. How could you actually kind of develop like a consortium where you could take
all of this data, it's for a greater good, and everyone could have access to the models that
are trained on this data to really help with kind of these developabilities where, you know,
it is an unknown unknown for us, you know, when we, you know, get to get to the end, but others
have, you know, the data. I think these are some interesting kind of ideas that we've had on how,
you know, the industry can, you know, can collaborate and form, you know, potential interesting,
you know, consortiums. Yeah. And I wanted to maybe just take a little bit of a side step and
touch on this topic, you know, earlier, Joshua, and you were talking about large language models
and the analogies to, you know, to biology. I mean, the analogies are really beautiful. Like,
if you think about, you know, proteins having essentially like 20 canonical amino acids and,
you know, the English alphabet has 26 characters and they're structured into words and those words
form paragraphs. Like there are a lot of, I think, similarities between those things. So I wanted to
ask a question along the same sort of axis, which is, you know, we've been talking for the last 30,
40 minutes about antibodies, which are a subset of biologics, right, proteins. The models that
you're building are certainly, you know, not limited to one particular application or one
particular modality. And I would like to understand a little bit more about, like, what's common,
what's different when we're in our local neighborhood of proteins? And then as we really
zoom out, like, what modalities do you think are most amenable or maybe the most kind of,
you know, recalcitrant to in silico design? Yes. Are you thinking outside of biology or outside
of antibodies more generally? Yeah. Just starting with antibodies kind of zooming out a little bit
to biologics and then maybe taking a big step back and thinking about everything from small
molecules to whatever else. Sure. So first starting with antibodies, you know, antibodies
are really interesting because most of the binding is conferred by the CDR regions,
componentarity determining regions in these antibodies. And that's something that turns
out that the model can learn extremely quickly, right, because it's something that is very clear
from the data. And that also means you have a very focused design task to work on designing those
CDRs. And that, you know, you really want to bake that into the task that you're working on. You
have these CDRs and then they're binding to a specific epitope. So we want to provide the model,
for example, with some structural information. You know, this is a departure from, let's say,
a traditional language model that's just in the sequence space. We're really thinking about how
do we bring in meaningful information about that target protein structure so that we can really
focus the model on that. I think this is something that's maybe more unique within the AI space,
right? It's, this is not just to chat GPT off the shelf, train on biology sequences.
This requires really some deep thought about how do you best apply AI to these problems.
Now, when we zoom out a step, so we're focused on antibodies, they have a specific form.
When you move to a general protein, you don't really have that information anymore. So you need
the model to be able to represent that information. I'll give you one of the challenges that comes
up here. If you take like a general protein model and you apply to antibodies, if you look at a lot
of the success of like protein language models within general protein design, essentially what
the models are learning to do, or at least what we, it's hard to know for sure anything in deep
learning. So it's really a little bit of like speculation, what these models are doing, but
they're presumably taking a bunch of evolutionarily related sequences and using that to build some
understanding of the protein world. So this is the way people used to do, let's say protein folding
10, 20 years ago, you would take a given protein sequence, you would go enumerate the most similar
proteins to that, and then you would start to do some statistics on top. Then you're like, okay,
I see like this position is always the same. That probably means that position is like doing
something really important. Or I see these positions covariate, it's always AB or CD,
probably means that they're touching each other. Now, antibodies don't work that way. You know,
antibodies are not the result of evolution. It's not random, you know, pieces of dirt that we're
finding around the world that were involved in different ways. It's created by the immune system.
So you need to figure out general models that can really pick up on all that information.
And I think that sort of segues into how do you apply this to an even broader picture
where the modalities really start to change in a broader way, right? If you wanted to have a model
that brings in proteins and small molecules, or maybe you wanted to bring in something like
protein dynamics, or if you wanted to represent a whole cell or a whole physiology,
again, you need to be thoughtful about what the domain, how the domain switches over time,
and then how to sort of build the right biases into your models to pick up on all that information.
And you mentioned this point, you know, about off-the-shelf bottles and reapplying them.
And I wanted to just kind of give a reference frame, like if we're talking about experimental
ground truth, you know, for those proteins that we can crystallize and use like cryo-EM or x-ray
crystallography, we're getting on the order of what in terms of accuracy, like an angstrom, roughly?
You're talking about for protein folding, predicting the structure?
Yeah, like what I would love to do is just for people who are thinking about how good these
models are at predicting the general structure of a protein and how that relates to what
experimental ground truth is, and like what's been that vector of improvement. And then I want
to flip it back over and just ask this theoretical question about, you know, is there an asymptote?
Like, is there a reason why they can't just blow right through that experimental, you know, plateau?
Yeah. Okay. So I think first of all, what you're citing here, let's say in a protein folding
setting, right? You have all these structures. Now you can use AI to predict those structures.
For the average protein, you can get pretty close to experimental accuracy. Like we're talking about
like an angstrom, as you just pointed out, right? That's sort of what you can get in the lab anyways.
So these things are very highly performant. Now there's exceptions, you know, something like
antibodies, it turns out are actually very hard to model. Those CDR loops that I mentioned before,
the ones that actually confer a lot of the binding, they're one of the hardest things to
actually model with protein folding tools. So one of the things we've done at Abside, for example,
is develop state-of-the-art capabilities for antibody antigen folding. So we can really get
a good idea of what our potential drugs are going to look like before we even go test them in the
lab. But where things start to get really interesting is accuracy becomes a lot harder to measure
when you're no longer in a prediction setting. You know, for protein folding, you have a sequence
and there's some structure, right? There is some answer. That sequence folds up into some structure.
I mean, maybe it's dynamic. We can, you know, have a whole conversation about whether there is
like one answer or not. But for the most part, it's a classification problem. When you move into
a design setting, it's not like that anymore. I take some target antigen, I design antibodies
against it. Like what is the correct solution? I mean, there is no correct solution. There are
probably hundreds of correct solutions and it really depends on the constraints that you're
putting into the model. So that's where evaluation becomes critical. The NLP people realized this,
you know, a few years ago and started to put together all these benchmarks, which, you know,
things like GPT-4 just put everything, you know, just blew everything out of the water.
And now there's like, did you just memorize all the benchmarks? How do we even evaluate models
anymore? And, you know, you're getting to that asymptote now. One of the big issues you're going
to have now in NLP, the big issue will be evaluation. In biology, it's actually more convenient
because when we have high throughput experimental capabilities, I don't need, if I have a hundred
thousand proteins that I generate, I don't need like a hundred thousand human labelers to go look
at each of those. I just have a hundred thousand bacteria do it. And it's actually a lot more
scalable. So I think this is one of the funny things about biology. It feels very esoteric
and it feels very hard to build the data. But going to that point about like asymptotic limits,
for NLP, it's starting to become really hard. And this is something where biology will be able
to pass, I think, because again, you can really evaluate all of this within the lab in a very
scalable way. Interesting. So the problem is kind of flipped. It's like, you know, in one hand,
it's dearth of data, but easy to do these valuations and then the mirror inverse of that
for NLP. Yeah, I mean, that's pretty exciting. Like I would love if the, you know, the dominant
sort of headlines around AI and just started to become mostly biological. And it seems like with
AlphaFold, I think that it really became, I mean, I remember going home for Thanksgiving is the
first time I've ever been asked about protein folding over Turkey. So that was, that was cool.
Yeah, no, thank you for that explanation. I think that makes a lot of sense. And, you know,
I think the importance is like the devil's in the details with these things, like, you know,
good enough can be fine for some downstream tasks. But to your point on antibodies, it's really got
to be virtually perfect to keep going, you know, down towards the next thing. So maybe in the last
couple of minutes, just kind of open ended questions around, you know, the pace of technological
improvement, what this field looks like by, you know, let's say a decade from now. And
if you think about the enabling technologies on both the dry lab and the wet lab side that could
really boost, you know, your capabilities of doing what you're doing, but also just for the
field of drug discovery, you know, it's an exciting thing for us because we're looking at,
you know, full stack like hardware, wet ware software. And I'm interested, you know, for you
guys, if you have strong opinions about that. Absolutely. I can start off with this. So where
this is all headed is personalized medicine. What we're going to start to see over the next
five to 10 years is seeing that 4% success rate, you know, start to increase, you know,
to 10% to 20% and so forth. And what that actually enables is you actually to go after
smaller and smaller patient populations because you're no longer, you know, having to pay for
as many drugs that that failed in the clinic. I mean, that's why drug development so expensive
is because you pay for the 96% that ultimately fail. But as you increase that, you can go,
you know, smaller and smaller patient population ultimately getting to the point where
it's cheap enough to do personalized medicine and ultimately being able to take a patient sample,
find the target that's relevant for that disease and then design an antibody that not only hits
the epitope and has the affinity that you want, but the model actually starts to understand the
biology. It knows that that target and it says, okay, this is the epitope and the affinity that
you need in order to achieve the biology that you're looking for. And I think that's the next
big step for us is being able to not only, you know, design the antibody hit the target we
want and the affinity, but actually starting to understand the biology. And I think that's going
to be a big next step for us is being able to scale data around the biology. So when I get a
brand new target that comes in, I want the model to then again, give me the antibody that hits the
epitope and the affinity that achieves my biology. And so I see synthetic biology playing a very
important role in all of this. I mean, you see how important synthetic biology was for our success.
I mean, we started out as a synthetic biology company. And that technology was what allowed us
to scale biological data to train the AI models. And the synthetic biology technology on its own
to get the data. Yes, we could get by with it, but it wasn't going to solve the 4% success rate.
It wasn't going to solve the decreased time to clinic. You need to combine the two in order to
solve these biological problems. And that's the exciting thing is because the next problem we have
to solve is then in the wet lab. So how do you scale the biology data to train the models? And
that's another CynBio problem. And so I feel like you're going to go, ultimately, CynBio is going
to play a very, very important role for developing technologies that can scale data to answer the
next question for AI. Got it. Got it. So it's basically two things, I think, that are shining
through on that for me. The first is people may not understand the disincentives around
creating drugs against rare disease. To your point, you underwrite this huge investment
with a low, very low probability of success. And to recoup your investment as a drug company on
the back end, you've got this finite window of time of patent protection to go out and make that
money back before there's competition. And not only is it difficult logistically to actually
source and find these patients when they're scattered about and there are not many of them,
but there just aren't many. And if you have a treatment that in some cases,
like we're seeing with some of these gene editing techniques that potentially could be cures,
right, then you're deleting that person's status as a sick person, which is great,
patient. And I think great for humanity. But to your point, the economic incentive has to be
there. And so by increasing the probability of success, you make tractable a greater subset
of diseases. So I think that's a really good point. And then on the synbioside, the genetic
engineering, you're right. I mean, from this whole conversation we've talked about,
there's a lack of biological data that you need to train these models
using large animal systems that require a lot of time, immunization in the case of antibodies.
It's just not a high enough throughput upscreen screening tool, right? We've got to figure out
in vitro mechanisms for generating data that's not only abundant, but high quality, testable,
functional. And to do that, we might have to do some genetic engineering of our own on these
single cell organisms. So I think those are both great points. Joshua, did you have anything else
that you wanted to add to the mix here? Yeah, I think maybe one thing on the technical side too
that I find really interesting here is about where does colonization kind of appear across
the value chain here? So one of the nice things I think about kind of working out this intersection
of AI and data is that the inputs and outputs to our models, I think the costs seem to be coming
down really quickly. So on the inputs, our model designs all these sequences, we need to go synthesize
the DNA encoding those sequences. So you're starting to see the cost of DNA synthesis really come down
over time. And then on the back end, we take all this DNA, we run it through our E. coli system,
we have this flow cytometry based assay. And at the end of the day, you've got sequences.
You've got DNA and you need to go sequencing. And sequencing, as we know, the costs are also
going down dramatically. So I think one really nice thing about working in this field is that if we
think about like the dollar cost of every, let's say, you know, data point we're creating, that
number is going down over time, which will mean that, you know, for the same amount of data, we
can say amount of money, we can just produce more data over time. So it's a really nice place to be.
And it's actually reminiscent, I think, of like the early days in computers. So,
you know, when Apple and Microsoft were building the first personal computers,
it was just really expensive to get hold of that hardware. And what did Microsoft do? They kind of
put out all the blueprints, how to make a computer. And people looked at that. And then all the OEMs
started to make their own computers, they commoditized it, got really cheap. And it's like, well,
you have all this hardware, but now you need the software and you go buy Microsoft OS.
So I think it's a really interesting thing. It's a way that really massive companies start to get
built when you have, you know, a real sort of advantage in the market because things around
you are getting cheaper. But the problem you're working on is very hard and you have a differentiated
angle on it. And I think at the end of the day, too, like the generative AI companies that are
going to win at the end of the day, and this is a cross to like all industries, are those that own
and control the data? At the end of the day, those are going to be the companies that ultimately
win. And I think like, you know, if you're looking at kind of through an investor lens,
especially in this space, it's the differentiation is the data. And where are you getting your data?
How are you training this? Because that's ultimately like what's going to enable you to
have a competitive mode. Because at the end of the day, when we go fully in silico, and we're able
to design a drug out of click of a button, it's going to be very hard for people to catch up to
us because we've, you know, spent so much time training the models on a ton of data to increase
the overall accuracy. And then we've done our own model designs as well. And so it comes down
to data. Data is key to success in generative AI. Yeah. And I, you know, this is another one of
those cool KPIs we were talking about earlier is like, what is your dollar per data point,
you know, governed by the inputs and outputs is another one that I like. And maybe the last
comment or question I'll open it up to you guys is, Joshua, I really wanted to zoom in on the
statement you made about, you know, the cost of the data point coming down. It seems like if you
look at the last 50 years of, you know, drug discovery, drug design, as that kind of improvement
vector as as, you know, continued churning along and the cost per data point comes down,
there's a point in which you almost like stop taking the traditional like hypothesis driven
approach to your problem. And you kind of just start letting the data point speak for themselves
and tell you what to do because it's cheap enough to do it that way. And of course there's
experimental design, I'm not saying throw that all out the window. But like, how do you think about
hypothesis driven science at an era where there's abundance now in, you know, the data that you're
able to generate? Well, that's a great question. I kind of did this thought experiment myself
recently. So I started just to, you know, I don't do enough coding in my day job anymore.
So I said, let me just try to build something in a programming language I haven't worked with for
for many years, right? Start building an iPhone app. And I use GPT-4 as kind of my co-pilot,
right, to help me build it. And when I realized something really interesting happened, I never
really done this before. I was literally just copying and pasting the code it was giving me
and putting it straight into Xcode to kind of build my app, right? Usually I'm going to read it very
carefully. I'm going to pull out the components I need. I'm going to rename the variables.
I was kind of just going on autopilot at some point, right? And I imagine a similar thing might
start to happen in drug discovery, right? Where the model just starts to get so good
that you're just like, okay, the model says go test this antibody. You just do it. You don't
even think twice because you just become second nature to trust this thing. So I think that might
be where the field is heading. Of course no one knows, but just, you know, it's kind of cool to
work in this space where you, you know, no, of course no one knows how things are going to play
out in drug discovery, but you do see it playing out in an earlier field. So a lot of those like
product experiences, you kind of get a sense into the future of what it's going to be like,
you know, in this industry. So yeah, that's one thing that might happen, right? It could
really change the game of how we do science here. So my question is what app are you building?
It was just, just playing around with some, some cool AI capabilities. I mean,
well, that was the cool thing about, you know, GPT-4 is able to build that app just in an evening,
right? Because you can just, you know, maybe, you know, it's actually, Sean is making me push
on this, right? How was I able to do it quickly? I had some playbook, right? I said, like, wrote
down a piece of paper, this is what I want the thing to look like and started building it that
way, right? So I think it's going to, it could lead to a world where we're just a lot more efficient,
where scientists, instead of getting distracted about, like, fancy technologies, you just trust
the model and you think about, you know, what is the, the real application that you want to go after,
right? So you think about your disease indication, for example, you can be very passionate about
that and allows us to be more strategic about biotech and stop spending as much money, kind of
chasing a lot of fancy technologies. I think drug discovery is really hard, so it's going to take
us some time to figure this out. And I think we're, you know, kind of being the trailblazers here on
that at AppSci, like, thinking about what that future looks like. So for folks who are excited
about this, like, come join us and work with us on that journey. But yeah, I think it's a really
exciting time to be working in science more generally, because AI is, I think, going to really
revolutionize the way that we think and do our work. Well, I think that's a good place to end it.
Guys, it's been a blast. For people that have been listening in and stuck with us through the end
here, please go follow AppSci on Twitter. We'll be sure to link, you know, all the recent literature
so people can, you know, stay engaged and learn about, you know, what you're doing and making
sure that they're informed. But other than that, Sean, Joshua, thank you for spending an hour talking
with us about this. It was a lot of fun and hope to see and talk to you guys again soon.
Yeah, thanks so much, Simon. Thanks, Simon. It's a lot of fun.
ARC believes that the information presented is accurate and was obtained from sources that ARC
believes to be reliable. However, ARC does not guarantee the accuracy or completeness of any
information. And such information may be subject to change without notice from ARC. Historical
results are not indications of future results. Certain of the statements contained in this podcast
may be statements of future expectations and other forward looking statements that are based on
ARC's current views and assumptions and involve known unknown risks and uncertainties that could
cause actual results, performance or events that differ materially from those expressed or implied
in such statements.
Machine-generated transcript that may contain inaccuracies.