Bachelor and Master Thesis

If you are interested in one of the topics below email us at mail@bsc.fu-berlin.de. If you have a proposal for your own research which fits our scope just let us know.

Available Thesis (Master and Bachelor)

Depeding on the available time, most topics below can be scaled to either B.Sc. or M.Sc. level.

Write a generic visualization app for mzQC

Adoption and public exposure of quality control in mass-spectrometry (MS) has gained increasing traction in recent years. The Proteomics Standard Initiative (PSI) has developed an open exchange format
named [mzQC](https://hupo-psi.github.io/mzQC/), which aims to foster capturing, exchanging and archiving quality control related data across all MS-based OMICS, such as proteomics, metabolomics and lipidomics.
Currently, there exists no software which is capable of visualizing and summarizing the content of any given mzQC file or a set of files.

Tasks:
1. Pick a visualization framework of your choice (e.g. Streamlit or R Shiny) and write code (probably in Python or R) to allow a user to explore the content of a given (uploaded) mzQC file.
2. Visualization could be a textual summary as well as (interactive) plots for the QC data contained within the mzQC file. Depending on the metrics properties, automated plot types should be chosen.
3. The app should be capable of loading multiple mzQC files (also via a drop-in folder, e.g. when new results arrive from a measurement) and displaying a timeline of analysis quality, where interesting metrics can be configured by the user.
4. When user-provided quality thresholds are not reached, the app should be able to send automated emails, warning the user of quality issues.

Highly multiplexed TMT-Support in OpenMS

OpenMS is a well-known, mature C++ Framework for the analysis of mass spectrometry data supporting various vendors. One common application in proteomics the multiplexing of samples (concurrent measurement) using Tandem-Mass-Tags (TMT).
Recently developed TMT kits support up to 35 channels which in itself is simply an extension of previous versions with up to 18 channels. However, due to tighly packed (in terms of mass differences) channel signals,
interaction of neighbouring channels requires a more sophisticated approach when correcting for channel cross-talk.
The project aims at deriving a cross-talk channel, based on simple mass differences, which can be used to correct cross-talk via an existing non-linear least squares algorithm.
To verify the correctness, unit tests and application level test will be implemented.

Quality control metrics in pMultiQC

pMultiQC is a new Python-based package for quality control for mass-spectrometry-based proteomics experiments. Recently, support for the generation of mzQC output files has been added to pMultiQC,
to enable export of quality control information from well known analysis software, such as MaxQuant, OpenMS or quantMS.
The aim of the thesis is to distill a set of important quality control metrics, which are predictive of an experiments quality and implement these in various submodules of pMultiQC. A benchmark should
provide insight into metrics which correlate (and are thus redundant) and those which are most informative.

An algorithm for Spectrum Identification Quality

The RMIC score served as a basic and rapid quality measurement independent of the applied peptide identification strategy, as it relies only on the given spectrum and peptide information (see https://doi.org/10.1002/pmic.201400560, which states: To evaluate the quality of an identified spectrum, the relative matched ion count (RMIC) score is calculated by the intensities of the matched fragment ions (a/b/c, x/y/z, y-NH₃, y-H₂0,b-NH₃, b-H₂0, precursor MH, MH-NH₃, and MH-H₂O) of the peptide divided by the total ion current of the related spectrum; an RMIC of 0.5 thus means that 50% of the spectral peak intensity can be explained by fragment ion peaks (within a window of 0.5 Da for each peak).

The goal is to implement a tool in OpenMS that computes RMIC, and potentially other scores, and evaluates their usefulness.

Simulate IonMobility, FineIsotope and other modern features

OpenMS (www.openms.de) has a rather well-rounded simulator for proteomics mass spectrometry data. However, modern instruments are capable of producing more complex data. E.g. add another separation dimension such as ion mobility in addition to the well-known retention time. Also, modern instruments are capable of resolving fine isotopic structures, e.g. mass defects.

The aim of this thesis is to implement these modern features into the simulator using C++. Some models (such as fine-isotope structures) are available in general, but not coupled to the simulator. Others, such as ion-mobility require a simulation from the ground up.

Comparison and benchmarking of DeNovo Search Engines

LC-MS-based Proteomics typically relies on database search engines which require a FASTA proteome as input. In contrast deNovo methods attempt to determine the peptide sequence underlying an MS/MS spectrum by purely algorithmic methods.

Recently, there have been many exciting publications detailing new exciting de novo tools: Casanovo, Spectralis, π-HelixNovo, InstaNovo, GraphNovo.

The aim of this thesis is to compare the strength and weaknesses of these tools and benchmark them against established deNovo methods and traditional DB search using appropriate benchmark datasets and evaluations metrics (incl. runtime, RAM etc).

Currently Running Topics

Already Finished Topics (incomplete)

Unsupervised detection of peptide modifications (Kilian Malek)

In order to identify a peptide sequence from mass spectrometry data, the user must provide both an appropriate database (e.g. the human proteome) and a list of expected (post-translational) modifications.,usually oxidation of M and acetylation of the protein N-term. While usually sufficient, this may lead to subtoptimal results if other modifications (e.g. deamidation) were introduced during sample handling.

Instead of employing an open modification search for every dataset (very CPU-intens and not part of a standard analysis), it would be good to have a method which detects automatically if this is necessary by inspecting the search results for clues. Thus, we are basically looking for a quality metric which might be based on differences of identified vs. unidentified MS spectra and their underlying data (e.g. using mass decimals, etc).

The new metric should be able to detect missing modifications using a benchmark dataset (e.g. from https://www.mcponline.org/content/15/8/2791 - dataset PXD002389).

Implement small-databases in Proteomics suite (Philipp Wang)

Small protein databases, or the fact that a research question only focusses on a rather small subset of proteins when the sample actually contains many proteins, demand special methods when computing false-discovery rates and statistical confidence - see https://www.youtube.com/watch?v=jIFyqXaN7RI&t=10s. The aim of this thesis is to implement group-FDR and neighbour-subset search strategies into OpenMS (www.openms.org) and validate and compare their performance against traditional FDR on adequate samples.

Proteomics Database Suitability (Tom Waschischeck)

Using the correct database (usually in FASTA format) for the annotation of MS/MS spectra in a proteomics experiment is vital to the successful annotation of peptide-spectrum-matches (PSMs) and downstream false-discovery-rate filtering. For well studied organisms, this can be easily achieved by simply downloading the database from public sources, such as UniProt. For other cases, mostly non-model organisms and meta-proteomics, the picture is less clear, since it is hard to assess how good the proteome of a related species or a set of assumed organisms really fits the data under study. Therefore, a database suitability score has recently been developed, which aims to alleviate some of the above issues: it reports a score from 0-100%. The interpretation of the score from the original paper is not entirely intuitive, since it suffers from a non-linearity. Ideally 100% would indicat that the database is a perfect match, i.e. covers all peptides in the experiment, whereas 0% would indicate completely unrelated sequences. This theses aims to accomplish this goal by implementing a corrected suitability score, using some theory on hyperbolic functions and additional search runs for extrapolating some function parameters. The results, will be validated using multiple search engines (which extends the original Comet-only approach) and datasets. Also the ability of detection of unusual contaminants, which are usually not accounted for (e.g. Mycoplasma), will examined.

Tree-based Alignment of LC-MS data (Julia Thüringer)

Problem:
Liquid chromatography (LC) - Mass Spectrometry (MS) data is extremely complex and whole-cell lysates are never fully sequenced using todays state-of-the-art technology.
Comparison of LC-MS data can be significantly enhanced by a so-called 'map alignment' where unidentified features are assigned to a peptide sequence using information acquired in another LC-MS run. Since LC is potentially unstable, an retention time (RT) correction procedure is usually required before ID's can be successfully transferred based on an accurate mass and time approach. To reduce the number of user-defined parameters and achieve maximal robustness at the same time, the alignment should not require a single reference run.

Solution:
A guide-tree based multiple alignment should be implemented within the OpenMS C++ software framework (www.openms.de).
The algorithm must feature a robust metric to estimate an initial distance matrix (e.g. percentage of overlapping ID's, stddev of matching IDs, or a combination of them).
A comparison against a current implementation requiring a reference using benchmark metrics such as 1) stdev of aligned pairs, maybe even use Cross-validation (subset of IDs for alignment, test on out-set) 2) number of transferrable IDs 3) number/size distribution of consensus-features (larger but fewer clusters should be preferred), should be conducted to prove the quality of the implemented solution.

Implementations of high quality will be integrated into the official release of the OpenMS software.

Requirements:
Expertise in C++ or a closely related language (Java) and object-oriented programming is strongly advised.
Basic knowledge of LC-MS is desirable, but can be acquired at a sufficient level during the first days.

A Fast Aho-Corasick implementation using Double-Arrays (Patricia Scheil)

Aho-Corasick (AC) is a prominent method to search multiple patterns concurrently in a large text. The algorithm is usually implemented using a trie datastructure, which wastes a lot of memory (at the potential speed benefit of eliminating the failure function when using a complete trie by direct jumps). There are many ways to compat a trie, but one of the most promising methods is using a double array (DA). This thesis will implement (one of the flavours of) a DA, which should allow to keep the datastructure small enough for common problem sizes in proteomics (60k peptides) to still fit into L3 cache of most modern CPUs (8 MB), thus avoiding costly cache misses. The new DA-CDA will be benchmarked against existing approaches, such as AC-Trie and Suffix-Arrays.

Implementation and validation of mzQC metrics in PTXQC (Chen Xu)

Experimental data in proteomics and metabolomics is usually acquired using high-throughput instrumentation such as high-performance liquid chromatography - mass spectrometry (HPLC-MS).
Data is then analyzed by automated pipelines, e.g. OpenMS or MaxQuant, without user interaction. Since the amount of data is simply too large for manual validation,
automated systems, such as PTXQC have been developed to extract known quality metrics from all stages of the pipeline to generate a human readable quality control report.
For longitudinal studies, summary statistics and machine learning approaches, it is desirable to have the QC data available in a standardized machine-readable format.
For this purpose, mzQC was developed which stores QC metrics in a JSON format using a controlled vocabulary. This enables querying QC data across any number of mzQC files in an automated fashion.
The goal of this thesis is to implement mzQC export functionality within PTXQC, by assigning existing CV terms to PTXQC metrics and writing a JSON file.
Then standardized JSON data can be generated for a variety of LC-MS studies and compared, to gain summary statistics of certain platforms, setups and the biological systems under study.

A Webserver for Quality Control Reports (Kristin Koehler)

Quality Control (QC) is an essential step in high-throughput technologies, such as mass-spectrometry based proteomics. We have developed a framework in R called PTXQC (available on CRAN), which can create such reports. To ease the usage and avoid the overhead of maintaing a local R installation for the practicioners/users, this thesis aims at developing a Shiny application (web service based on R components) which where users can upload their data and receive a QC report in return. A successful implementation will be installed permanently on FU webservers to serve the research community.

Requirements: Expertise in the R programming languare is strongly advised. Interest in web technology and Docker containers is a must. Basic knowledge of LC-MS is desirable, but can be acquired at a sufficient level during the first days.

Bioinformatics Solution Center