by Ari Wagen · Aug 22, 2025
At Rowan, we are in the final stages of deploying our own ColabFold server for both Rowan users using our web application and API and other computational scientists. If you would like access to our cloud MSA server, please fill out this form.
What follows is a detailed play-by-play narrative and postmortem examining a rise in MSA failures that caused protein co-folding job failures, both from our perspective and from what we can piece together from public sources, and some reflections about what this means for the field.
Saturday, August 16, 2025 at 7:37pm. I was at a friend's house in Arlington, MA playing cards when I got a notification on my phone from a Rowan user alerting me to a pattern of Boltz-2 co-folding failures—the user wrote that Boltz-2 was failing "reproducibly," but the inputs he shared all seemed normal at a glance.
Sunday, August 17, 2025 at 12:11am. In an attempt to understand what problem our user was facing, I ran a series of co-folding jobs with different protein sequences, numbers of tokens, and settings. To my surprise, all of them failed relatively quickly. I escalated the problem to our engineering team, writing:
I think we might be getting rate-limited by KOBIC [the MSA server at api.colabfold.com], causing Boltz-2 jobs to fail: [link to jobs I ran]
Even the most simple jobs aren't completing
I also checked our job failure rate dashboard; over the last 24 hours, roughly 50% of the protein–ligand co-folding jobs submitted through Rowan had failed. This job failure rate was anomalous.
Our director of engineering wrote back 12 minutes later with a string of messages:
Exception: MMseqs2 API is giving errors. Please confirm your input is a valid protein sequence. If error persists, please try again an hour later.
error message doesn't suggest rate limit at least. It could be but unclear
Shud [sic] probably figure out a more robust way to do MSA regardless
I logged some to-dos from our conversation and went to sleep.
Sunday, August 17, 2025 at 5:31pm. I was at home cooking dinner when a second Rowan user submitted a ticket reporting Boltz-2 failures. Shortly after, Corin messaged our engineering channel:
Let's do a detailed investigation tomorrow morning and refund credits and email affected users
Monday, August 18, 2025 at 8:59am. In the discussion section of our meeting, we talked about the MSA-related errors and came up with a plan to start addressing the issue. We determined that we would take the following near-term steps:
We also decided to prioritize exploring a more permanent set of MSA solutions, including standing up an internal version of the ColabFold server and caching MSA query outputs for reuse.
Monday, August 18, 2025 at 2:28pm. That morning and early afternoon, we added a warning to Rowan's co-folding submit page, tested and turned on MSA-free Boltz-2 runs, and compiled a list of affected users. Between 2:28pm and 3:31pm, I emailed each affected user, sending variations on the following message:
hi [user],
I'm emailing to let you know that we've refunded your Rowan account [123] credits on account of co-folding job failures.
Starting on August 16, 2025, the MSA server at https://api.colabfold.com/ (hosted by the Korean Bioinformation Center) has been intermittently failing with vague errors, causing ~50% of Boltz-2 jobs submitted through Rowan to fail. (This issue is affecting all Boltz-2 users, not just Rowan users.)
Between the beginning of the impacted period and now, you've had [12] jobs fail, which consumed [123] credits. We've added [123] credits back to your account.
While we work to improve the stability of MSA queries, the submit page will display a warning message about this issue. We've also added support for running Boltz-2 jobs without MSA: to run co-folding without MSA, deselect the "Use MSA Server?" toggle. We are actively working on longer-term improvements to co-folding stability.
If you have any questions, feel free to respond to this email.
best, Ari
From our internal aggregated job data, we've seen that protein–ligand co-folding job failures have not yet returned to pre–August 16 levels:
Date | Co-Folding Failure Rate |
---|---|
Monday, August 11 | 1.2% |
Tuesday, August 12 | 4.1% |
Wednesday, August 13 | 1.3% |
Thursday, August 14 | 4.7% |
Friday, August 15 | 10.1% |
Saturday, August 16 | 51.6% |
Sunday, August 17 | 67.6% |
Monday, August 18 | 48.8% |
Tuesday, August 19 | 23.0% |
Wednesday, August 20 | 16.8% |
Additionally, we've noticed that our users are submitting fewer co-folding jobs than before we added the warning to our submit page.
Before continuing, I'd like to look briefly at what MSA is, why it matters, and how it's run.
Multiple sequence alignment (MSA) is the process of aligning a set of protein sequences to identify regions of similarity. In this context, "similar" means that they share subsequences, or substrings, of amino acids. The motivation behind MSA is often described using evolutionary logic; the story goes something like this:
Long ago, there were fewer organisms and fewer proteins. Over time, millions of mutations happened, and deleterious mutations were lost to the wind. However, some mutations were able to improve or modify a protein's function, and these proteins have persisted. Proteins with a common ancestor protein are likely to have similar secondary and tertiary structures, so data from relatives can help improve protein folding and co-folding algorithms.
The extra context that MSA adds to the problem of protein folding has made it a cornerstone of computational biology: AlphaFold, AlphaFold2, Boltz-1, Chai-1, and Boltz-2 have all relied on MSA to make high-quality structural predictions. MSA queries are run by default on protein sequences whenever a user submits a co-folding job using any code or platform.
State-of-the-art MSA generally relies on MMseqs2 (Steinegger and Söding [2017]) run through ColabFold (Mirdita et al. [2022]). With these modern tools, MSA and protein folding are theoretically within the reach of anyone with a PC—so why do places like Rowan, Neurosnap, Oxford, CINES, and the Boltz code itself all use the same public API?
While setting up any server can be a pain, ColabFold servers in particular require nearly 1 terabyte of data to run standard MSA queries. Nevertheless, a computer with a terabyte of storage can be purchased for under $1,000–2,000, so this shouldn't be out of reach for university labs and companies.
The real issue here is memory requirement. Efficiently running MSA queries requires a lot of memory. To run the alignment process, many protein sequences have to be loaded from storage into a computer's memory. If that computer has very little memory, then there's a lot of back and forth between the storage and the memory, making queries take as long as hours on consumer hardware.
Because of this, many teams have chosen to rely on the free ColabFold server at api.colabfold.com, which is hosted by the Steinegger lab at the Korean Bioinformation Center, or KOBIC (references: 1a0b670, d9adee4, 04a1791, 16a09e1).
OK, back to the story.
Wednesday, August 6. On the ColabFold GitHub, user TuganBasaran opened an issue saying:
Hi,
I'm trying to use Alpafold 2 Batch on colab and run a prediction of a fasta file and at the moment it has been pending for almost 5 hours for a single sequence. I have no idea why does it happen. I tried to restart session, changed the GPU 2-3 times and currently it is GPU is set to A100.
Is there a way to fix this problem?
sirius777coder, a second coder, replied to report having a similar issue.
Friday, August 15. On the ColabFold GitHub, sirius777coder opened an issue saying:
Hi, I submitted several jobs that remain pending after predicting 20 sequences, and the status has been stuck for a day. Could you check the current MSA server?
Monday, August 18 at 2:35am. On the Boltz community Slack, a member reported encountering errors with MSA step in their Boltz-2 pipeline, saying:
Hi Everyone, I managed to run few predictions, but as of today inference fails with: boltz/data/msa/mmseqs2.py", line 215, in run_mmseqs2. But they were working three days back. Exception: MMseqs2 API is giving errors. Please confirm your input is a valid protein sequence. If error persists, please try again an hour later.
At 4:44am, a second memeber responded: "I am also having the same issue since Friday (15th August)…"
Monday, August 18. GitHub user Jacoberts reported receiving a 403 Forbidden Error from the ColabFold API.
Tuesday, August 19 at 5:26am. Responding to the previous Boltz-2 Slack thread, a third member asked "is it working for you? it's still not working fo rme [sic]."
Wednesday, August 20 at 5:51am. Another Boltz-2 Slack member reported having issues with ColabFold as well, writing:
Hi everyone, I am unable create a GitHub issue in the below repository.
https://github.com/sokrypton/ColabFold/issues
Can someone please help me how to get collaborator on this repo? So that I can create an issue. Thank you in advance!!
From the responses to a number of GitHub issues, we can start to understand why we and so many other Boltz-2 users have been experiencing difficulties using the api.colabfold.com API this past week.
Last week, user milot-mirdita (a postdoc in Steinegger's lab) responded to ColabFold issue #759, saying:
I just killed the whole queue, someone filled the queue with 2500 jobs :/
Sorry about that. Please resubmit your jobs.
We will need to implement some fairer prioritization mechanism and rethink how this job queue works
This issue was opened and close before this week, and it seems like the impact was small enough to fly under our radar at Rowan, but we can see retrospectively that the global demand for co-folding and MSA queries was already putting a strain on the api.colabfold.com computers.
Over the weekend, in response to ColabFold issue #763, Prof. Steinegger wrote, "https://api.colabfold.com/queue It looks like the queue is crazy full. Somebody is probably running a big job :/," and later "We restarted the server and blocked some user groups. I hope it is working better now." sirius777coder reported that "my task is running normally now!", and the issue was closed.
This second issue adds another data point to our understanding. A surge in MSA queries was putting Steinegger's lab into a "whac-a-mole" problem-solving mode: manually clearing the queue and blacklisting users from accessing the API.
The most complete information we have is from issue #764, which was opened on Monday. In response to this issue, Milot wrote:
I guess you became collateral damage of someone starting a huge job over the weekend and totally flooding the server. They did that in such a way that it was difficult for the normal rate limiting mechanisms to take deal with it, so I had to take a sledgehammer approach.
Please send me (either here or per email) your ip address/range and I'll enable access again.
And:
I changed the blocks to target cloud IPs at finer granularity. I hope you are not blocked anymore.
After three more users replied with similar issues, Milot wrote a comment on Wednesday, August 20, saying:
Please generate MSA's locally and run boltz [sic] on the locally generated MSAs. We currently cannot support the load from batch cloud runs.
The progress in biomolecular structure prediction over the past two decades is one of the big success stories of deep learning and has transformed the way many teams approach science.
As demand for running methods like Boltz-2, AlphaFold2, and Chai-1 continues to grow, our team finds it unlikely that academic labs will continue to be able and willing to host reliable, public web servers for commercial entities to run expensive MSA queries at library scale.
To shift the MSA load generated by Rowan users off of the Steinegger lab's MSA web server as well as to provide any blacklisted groups with an alternative option to self-hosting a machine with very onerous memory requirements, we will be standing up a Rowan-hosted instance of the ColabFold server. We are currently in the final stages of testing and optimizing this. If you would like access to this server, please complete this form.
As soon as we have our server tested, stable, and deployed for production use, we will shift the MSA queries being made by Rowan users over to our server. We plan to implement a priority queuing system to prioritize MSA queries made by users of Rowan's web application and free-to-use Python API before running MSA queries made by others with access to the server. We also plan to explore opportunities to save compute cost and optimize job time by automatically caching MSA queries made through the Rowan web application and Python API where applicable.
I'm very sorry for the unforeseen sharp increase in job failure rates that Rowan users experienced over the weekend, and I invite any feedback about our incident handling, response, and communication to contact@rowansci.com.