Frequently Asked Questions#

Running Casanovo#

I installed Casanovo and it worked before, but I after reopening Anaconda it says that Casanovo is not installed.#

Make sure you are in the casanovo_env environment. You can ensure this by typing:

conda activate casanovo_env

Which command-line options are available?#

Run the following command in your command prompt to see all possible command-line configuration options:

casanovo --help

Additionally, you can use a configuration file to fully customize Casanovo. You can find the config.yaml configuration file that is used by default here.

How do I solve a “PermissionError: GitHub API rate limit exceeded” error when trying to run Casanovo?#

When running Casanovo in denovo or eval mode, Casanovo needs compatible pretrained model weights to make predictions. If no model weights file is specified using the --model command-line parameter, Casanovo will automatically try to download the latest compatible model file from GitHub and save it to its cache for subsequent use. However, the GitHub API is limited to maximum 60 requests per hour per IP address. Consequently, if Casanovo has been executed multiple times already, it might temporarily not be able to communicate with GitHub. You can avoid this error by explicitly specifying the model file using the --model parameter.

GPU Troubleshooting#

Casanovo is very slow even when running on the GPU. How can I speed it up?#

It is highly recommended to run Casanovo on the GPU to get the maximum performance. If Casanovo is slow despite your system having a GPU, then the GPU might not be configured correctly. A quick test to verify that Casanovo is using your (CUDA-enabled) GPU is to run watch nvidia-smi in your terminal. If Casanovo has access to the GPU, then you should see it listed in the bottom process table, and the “Volatile GPU-Util” column at the top right should show activity while Casanovo is processing the data.

If Casanovo is not listed in the nvidia-smi output, then it is not using your GPU. This is commonly caused by an incompatibility between your NVIDIA drivers and Pytorch. Although Pytorch is installed automatically when installing Casanovo, in this case we recommend reinstalling it manually according to the following steps:

Uninstall the current version of Pytorch: pip uninstall torch
Install the latest version of the NVIDIA drivers using the official CUDA Toolkit. If supported by your system, an easy alternative can be conda using conda install -c nvidia cuda-toolkit.
Install the latest version of Pytorch according to the instructions on the Pytorch website.

Try to run Casanovo again and use watch nvidia-smi to inspect whether it can use the GPU now. If this is still not the case, please open an issue on GitHub. Include full information about your system setup, the installed CUDA toolkit and Pytorch versions, and the troubleshooting steps you have performed.

Why do I get a “CUDA out of memory” error when trying to run Casanovo?#

This means that there was not enough (free) memory available on your GPU to run Casanovo, which is especially likely to happen when you are using a smaller, consumer-grade GPU. Depending on whether the error occurred during train or denovo mode, we recommend decreasing the train_batch_size or predict_batch_size options, respectively, in the config file to reduce the number of spectra that are processed simultaneously. Additionally, we recommend shutting down any other processes that may be running on the GPU, so that Casanovo can exclusively use the GPU.

How can I run Casanovo on a specific GPU device?#

You can control which GPU(s) Casanovo uses by setting the devices option in the configuration file. This setting also controls the number of cores to use when running on a CPU only (which can be specified using the accelerator option).

By default, Casanovo will automatically try to use the maximum number of devices available. I.e., if your system has multiple GPUs, then Casanovo will use all of them for maximum efficiency. Alternatively, you can select a specific GPU by specifying the GPU number as the value for devices. For example, if you have a four-GPU system, when specifying devices: 1 in your config file Casanovo will only use the GPU with identifier 1.

The config file functionality only allows specifying a single GPU, by setting its id under devices, or all GPUs, by setting devices: -1. If you want more fine-grained control to use some but not all GPUs on a multi-GPU system, then the CUDA_VISIBLE_DEVICES environment variable can be used instead. For example, by setting CUDA_VISIBLE_DEVICES=1,3, only GPUs 1 and 3 will be visible to Casanovo, and specifying devices: -1 will allow it to utilize both of these.

Note that when using CUDA_VISIBLE_DEVICES, the GPU numbers (potentially to be specified under devices) are reset to consecutively increase from 0.

I see “NotImplementedError: The operator ‘aten::index.Tensor’…” when using a Mac with an Apple Silicon chip.

Casanovo can leverage Apple’s Metal Performance Shaders (MPS) on newer Mac computers, which requires that the PYTORCH_ENABLE_MPS_FALLBACK is set to 1:

export PYTORCH_ENABLE_MPS_FALLBACK=1

This will need to be set with each new shell session, or you can add it to your .bashrc / .zshrc to set this environment variable by default.

Training Casanovo#

Where can I find the data that Casanovo was trained on?#

Different versions of Casanovo were trained on distinct datasets, as detailed below. For explanations about the different versions, please refer to the citation information page.

Casanovo v2.0

Casanovo v2.0 was trained on a commonly used nine-species benchmark. The data for this benchmark are available as annotated MGF files on MassIVE with dataset identifier MSV000090982. Annotated MGF files that are directly compatible with Casanovo are available in the /MSV000090982/updates/2024-05-14_woutb_71950b89/peak/9speciesbenchmark FTP directory. Casanovo was trained in a cross-validated fashion, training on eight species and testing on the remaining species.

A further reanalysis of these data is described here.

Casanovo v4.0

Casanovo v4.0 was trained on two datasets:

The MassIVE-KB training data, derived from PSMs used to compile the MassIVE-KB v1 spectral library, consisting of 30 million PSMs. These PSMs were obtained by collecting up to the top 100 PSMs for each of the precursors (as defined by a peptidoform and charge) included in MassIVE-KB. The data are available as annotated MGF files on Zenodo:
- MassIVE-KB v1 30 million PSMs
- MassIVE-KB v1 30 million PSMs training/validation/test splits
The dataset was originally compiled through the following steps:
- On the MassIVE website, go to MassIVE Knowledge Base > Human HCD Spectral Library > All Candidate library spectra > Download.
- This will give you a zipped TSV file with the metadata and peptide identifications for all 30 million PSMs.
- Using the filename (column “filename”) you can then retrieve the corresponding peak files from the MassIVE FTP server (done using a wget script) and extract the desired spectra using their scan number (column “scan”).
The non-enzymatic dataset, used to train a non-tryptic version of Casanovo, was created by selecting PSMs with a uniform distribution of amino acids at the C-terminal peptide positions from two datasets: MassIVE-KB and PROSPECT. Training, validation, and test splits for the non-enzymatic dataset are available as annotated MGF files on MassIVE with dataset identifier MSV000094014.

Casanovo v4.2

Casanovo v4.2 was trained on a combined tryptic and non-tryptic dataset, as described in this manuscript. This dataset consists of 2 million PSMs sampled from MassIVE-KB, with tryptic peptides sampled from the MassIVE-KB v1 data described above and multi-enzyme data derived from the corresponding subsection of MassIVE-KB v2.0.15. The data are available on Zenodo as annotated MGF files for the training and test splits in the mskb_final subdirectory.

Casanovo v5.0

Casanovo v5.0 was trained on the same dataset as v4.2.

How do I know which model to use after training Casanovo?#

When running model validation, Casanovo will use the validation data to compute performance measures (training loss, validation loss, amino acid precision, and peptide precision) and print this information to the console and log file. At the end of each validation run and at the end of each training epoch (one complete run over the training data), Casanovo will take a snapshot of the current model weights. After the training job is finished, the validation snapshot that achieved the lowest validation loss will be saved to the output directory as <output_root>.best.ckpt. Additionally, a snapshot of the model weights at the end of each training epoch will be saved to the output directory as epoch=<epoch>-step=<step>.ckpt. Snapshots from previous training epochs will be overwritten with the latest training snapshot at the end of each training epoch.

By default, Casanovo runs model validation every 50,000 training steps. Note that the number of samples that are processed during a single training step depends on the batch size. Therefore, the default training batch size of 32 corresponds to saving a model snapshot after every 1.6 million training samples. You can optionally modify the validation run frequency in the config file (parameter val_check_interval), depending on your dataset size. Note that running model validation very frequently will result in slower training time because Casanovo will evaluate its performance on the validation data for every validation check.

Even though I added new post-translational modifications to the configuration file, Casanovo didn’t identify those peptides.#

Casanovo can only make predictions using post-translational modifications (PTMs) that were included when training the model. If you want to add new types of PTMs, then you will need to retrain the model.

The config.yaml configuration file contains all amino acids and PTMs that Casanovo knows. By default, this includes oxidation of methionine, deamidation of asparagine and glutamine, N-terminal acetylation, N-terminal carbamylation, and an N-terminal loss of ammonia. (Additionally, cysteines are always considered to be carbamidomethylated.) Simply making changes to the residues alphabet in the configuration file is insufficient to identify new types of PTMs with Casanovo, however. This is indicated by the fact that this option is not marked with (I) in the configuration file, which indicates options that can be modified during inference. All remaining options require training a new Casanovo model.

Therefore, to learn the spectral signature of previously unknown PTMs, a new Casanovo version needs to be trained. To include new PTMs in Casanovo, you need to:

Update the residues alphabet in the configuration file accordingly.
Compile a large training dataset that includes those PTMs and format this as an annotated MGF file. Note that you can include some or all of the data that was originally used to train Casanovo (see above), in addition to the data that includes your new types of PTMs.
Train a new version of Casanovo on this dataset.

Alternatively, you can fine-tune an existing pre-trained Casanovo model on data with new PTMs instead of training from scratch. See the next question for a step-by-step guide.

How do I fine-tune Casanovo on data with new PTMs?#

Instead of training a new model from scratch, Casanovo can fine-tune a pre-trained checkpoint with an extended residue vocabulary. When loading the checkpoint, Casanovo copies the learned weights for all tokens that the checkpoint already knows, and initializes new tokens from existing ones using the new_token_init configuration option.

To fine-tune Casanovo with new PTMs:

Add the new PTM to the residues dictionary in the configuration file. For example, to add acetylation of lysine:
```
residues:
  # ... existing residues ...
  "K[Acetyl]": 170.105528
```
Map each new token to an initialization source in the new_token_init configuration option. This tells Casanovo which existing token’s learned weights to copy as a starting point for the new token:
```
new_token_init:
  "K[Acetyl]": "K"
```
Prepare an annotated MGF training dataset that includes spectra with the new PTMs. You can combine this with some or all of the data that was originally used to train Casanovo (see above).

Run fine-tuning, specifying the pre-trained checkpoint with --model:

casanovo train --model pretrained.ckpt -p validation_spectra.mgf training_spectra.mgf

Tip

The initialization source in new_token_init should be the unmodified amino acid (e.g. "K" for "K[Acetyl]") or an amino acid with a similar PTM that the model already knows. This gives the model a reasonable starting point for the new token’s weights.

Warning

Every new token in residues that is not present in the checkpoint must have a corresponding entry in new_token_init. Omitting an entry will cause Casanovo to raise a ValueError at startup.

Monitoring for catastrophic forgetting. When fine-tuning, the model may lose performance on its original training distribution. To monitor this, use the --tracking_peak_path (-t) option to track validation loss on additional files without influencing checkpoint selection:

casanovo train --model pretrained.ckpt -p new_ptm_validation.mgf -t original_validation.mgf training_spectra.mgf

Files specified with -p (--validation_peak_path) contribute to the aggregate valid_CELoss metric used for selecting the best checkpoint. Files specified with -t (--tracking_peak_path) are logged per-file only (as valid/<file_stem> in the console and CSV logs, e.g. valid/original_validation) and do not influence which checkpoint is saved as best. This makes it easy to detect if the model is forgetting what it previously learned.

Tip

To view per-file tracking losses, enable tb_summarywriter: true or log_metrics: true in the configuration file.

How can I change the learning rate schedule used during training?#

By default, Casanovo uses a learning rate schedule that combines linear warm up followed by a cosine decay (as implemented in CosineWarmupScheduler in casanovo/denovo/model.py) during training. To use a different learning rate schedule, you can specify an alternative learning rate scheduler as follows (in the lr_scheduler variable in function Spec2Pep.configure_optimizers in casanovo/denovo/model.py):

lr_scheduler = torch.optim.lr_scheduler.LinearLR(optimizer, total_iters=self.warmup_iters)

You can use any of the scheduler classes available in torch.optim.lr_scheduler or implement your custom learning rate schedule similar to CosineWarmupScheduler.

Miscellaneous#

Can I use Casanovo to sequence antibodies?#

Yes, antibody sequencing is one of the popular uses for de novo sequencing technology. This article carried out a systematic comparison of six de novo sequencing tools (Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo). Casanovo fared very well in this comparison: “Casanovo exhibits the highest number of correct peptide predictions compared with all other de novo algorithms across all enzymes demonstrating the advantage of using transformers for peptide sequencing. Furthermore, Casanovo predicts amino acids with overall superior precision.”

In practice, you may want to try providing your Casanovo output file to the Stitch software, which performs template-based assembly of de novo peptide reads to reconstruct antibody sequences (Schulte and Snyder 2024).

Where can I find Casanovo model weights trained on the nine-species benchmark?#

You can find the Casanovo weights corresponding to the nine-species benchmark on Zenodo, compatible with Casanovo v4.x.x. These weights correspond to training and validation on eight species using the default configurations, with the remaining species held out for testing, as indicated by the file names. Note that these weights are only intended for evaluation purposes on this specific benchmark dataset. For general-purpose usage of Casanovo, use its default weights instead, as these will give significantly improved performance.

How can I generate a precision–coverage curve?#

You can evaluate a trained Casanovo model compared to ground-truth peptide labels using a precision–coverage curve.

Run Casanovo in sequencing or evaluation mode on your MS/MS data, as described here.
Collect the ground-truth peptide labels as well as the peptide labels predicted by Casanovo. Note that Casanovo might not report a peptide for every spectrum if the spectra are invalid (e.g. not enough peaks), so make sure that both pieces of information are correctly linked to each other (using the spectra_ref column in the mzTab output file produced by Casanovo).
Use the following script to plot a precision–coverage curve:

import depthcharge
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import auc

from casanovo.denovo import evaluate


# `psm_sequences` is assumed to be a DataFrame with at least the following
# three columns:
#   - "sequence": The ground-truth peptide labels.
#   - "sequence_pred": The predicted peptide labels.
#   - "search_engine_score[1]": The prediction scores.
psm_sequences = ...  # TODO: Get the PSM information.

# Sort the PSMs by descreasing prediction score.
psm_sequences = psm_sequences.sort_values(
    "search_engine_score[1]", ascending=False
)
# Find matches between the true and predicted peptide sequences.
aa_matches_batch = evaluate.aa_match_batch(
    psm_sequences["sequence"],
    psm_sequences["sequence_pred"],
    depthcharge.masses.PeptideMass("massivekb").masses,
)
# Calculate the peptide precision and coverage.
peptide_matches = np.asarray([aa_match[1] for aa_match in aa_matches_batch[0]])
precision = np.cumsum(peptide_matches) / np.arange(1, len(peptide_matches) + 1)
coverage = np.arange(1, len(peptide_matches) + 1) / len(peptide_matches)
# Calculate the score threshold at which peptide predictions don't fit the
# precursor m/z tolerance anymore.
threshold = np.argmax(psm_sequences["search_engine_score[1]"] < 0)

# Print the performance values.
print(f"Peptide precision = {precision[threshold]:.3f}")
print(f"Coverage = {coverage[threshold]:.3f}")
print(f"Peptide precision @ coverage=1 = {precision[-1]:.3f}")

# Plot the precision–coverage curve.
width = 4
height = width / 1.618
fig, ax = plt.subplots(figsize=(width, width))

ax.plot(
    coverage, precision, label=f"Casanovo AUC = {auc(coverage, precision):.3f}"
)
ax.scatter(
    coverage[threshold],
    precision[threshold],
    s=50,
    marker="D",
    edgecolors="black",
    zorder=10,
)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)

ax.set_xlabel("Coverage")
ax.set_ylabel("Peptide precision")
ax.legend(loc="lower left")

plt.savefig("prec_cov.png", dpi=300, bbox_inches="tight")
plt.close()