Project 2

Towards GC-MS: Adapting SIRIUS and CSI:FingerID for Electron Ionization fragmentation

Supervisors: Sebastian Böcker, Georg Pohnert, FSU Jena; Ales Svatos, MPI for Chemical Ecology

Background: Mass spectrometry (MS) is the analytical platforms of choice for high-throughput screening of small molecules. MS is typically used in combination with a chromatographic separation technology; gas chromatography (GC-MS) is arguably still the best separation tool for compounds amenable to the technique. Electron (impact) Ionization (EI) simultaneously ionizes and fragments the molecules; resulting spectra are fragment-rich but often show a low-intensity or missing molecular ion peak, meaning that the mass of the compound is often unknown. Lately, technically mature GC-MS instruments with high mass accuracy are available, making de novo interpretation of EI fragmentation data possible. The Böcker group develops the computational tools SIRIUS and CSI:FingerID: SIRIUS allows annotation of fragmentation spectra using fragmentation trees, and does not require any databases for doing so. CSI:FingerID (Dührkop et al., PNAS 2015) allows to search the unknown compound in a structure database such as HMDB or PubChem, and has repeatedly won contests for this task (http://casmi-contest.org/). Both tools are in high use by the scientific community: In 2017, data from more than 700,000 compounds queries have been uploaded to the CSI:FingerID web service for identification. The Böcker group also researches related subjects in computational mass spectrometry, such as False Discovery Rate estimation for spectral library search (Scheubert et al., Nature Communications 2017).

Project Description: With the advent of high mass accuracy GC-MS instrumentation, it becomes possible to adapt our computational tools for GC-MS data. GC-MS and EI fragmentation is different in many details from LC-MS and tandem MS, and several subproblems must be addressed; for example:

1.    EI mass spectra are often missing the molecular ion peak, and the mass and/or molecular formula of the compound has to be reconstructed from the fragments using Machine Learning and combinatorics.
2.    EI mass spectra contain isotope patterns, which can be used to improve fragmentation tree quality. Unfortunately, radical losses H and H3 often interfere with the interpretation of the isotope patterns.
3.    Available reference data for high mass accuracy GC-MS is insufficient to train Machine Learning methods. To bypass this, we want to "lift" low mass accuracy spectra and add them to the training data.

We will also promptly apply developed methods to biological data. The project will be conducted in close collaboration with experimentalist groups, in particular those of Georg Pohnert and Aleš Svatoš.

Candidate Profile:

  • M.Sc. in bioinformatics, cheminformatics, computer science, mathematics
  • Expertise and interest in algorithmics and bioinformatics methods development
  • Experience in biochemistry is highly desirable
  • Expertise in Machine Learning is desirable
  • Experience in software development (Git, artifactory)
  • Java, Python
  • Ability to interact with coworkers, collaboration partners and software users

Project 2 pdf