The Pillar Post

Low-frequency somatic mutation variant calling is a challenge

Low-frequency somatic mutation variant calling is a challenge

March 27, 2018 by Dale Yuzuki

The complexity of the NGS process, compounded by sample processing, results in unique challenges for the clinical oncology laboratory

Analysis of clinical samples for matching patients to targeted therapy is an inexorable trend. In the summer of 2017, Thermo Fisher received the first FDA companion diagnostic test approval for multiple non-small cell lung cancer (NSCLC) therapies , while Foundation Medicine received FDA approval for their FoundationFocus CDxBRCA as a companion diagnostic for an ovarian cancer treatment, using their Comprehensive Genomic Profiling assay.

Circulating cell-free assays for multiple purposes can be considered the ‘next big thing’. Roche’s cobas EGFR Mutation Test v2 was approved by the FDA in the summer of 2016 as a PCR-based assay of specific EGFR mutations in metastatic NSCLC as a companion diagnostic to a specific Tarceva (erlotinib) therapy. Other commercial laboratory-developed tests (LDTs) from organizations like Foundation Medicine and Guardant Health are actively marketed and sold to oncologists, and their submissions for regulatory approval are underway. There are several other examples, but the overall trend is clear: next-generation sequencing has arrived in the molecular pathology laboratory, widening from examination of solid tumor samples into analysis of hematological cancers and cell-free DNA analysis.

Pre-analytical variability in sample types

One key consideration is how the samples are treated before any nucleic acid purification occurs. The standard sample handling after a surgical biopsy is for tissue to be fixed in formalin, and then embedded in paraffin wax for sectioning by a microtome and then examination of stained sections by a pathologist. If deemed necessary, several of these slides will be provided to the molecular laboratory for analysis.

One problem occurs at this phase: there is variability in the quality of the reagents (especially the formalin used for preserving the tissue), and in the time spent in the preservative as well (after submersion into formalin for the fixation step, the incubation time can range from hours to days, which depends on the thickness and volume of the tissue sample). Artifacts in the sequencing data occur due to variation of the different reagents involved or can be caused by deamination (conversion of cytosines residues to uracil and 5-methyl cytosines to thymine), which are then read by the sequencer as a thymine).

For FFPE samples, other effects at work have to do with the pH of the formalin (comprised of 4% formaldehyde in phosphate buffer), as lower pH will cause acid-induced hydrolysis and fragmentation of DNA. This pH change is caused by the oxidation of the formaldehyde by atmospheric oxygen, producing formic acid. The bond between purine bases and the sugar-phosphate nucleic acid backbone will hydrolyze, generating abasic sites. During library preparation, where a DNA polymerase is reading through an abasic site, adenines (and secondarily guanines) are preferentially incorporated. For additional background, see this reference.

For cell-free DNA analysis, there has been a number of research articles comparing methods of preservation (both standard and proprietary commercial methods). A recent publication comparing three commonly used methods, EDTA, Streck and CellSave blood collection tubes, indicated a 6- hour window to isolate plasma from all three methods. Yet, it is clear that the method of collection into these collection tubes need further examination and clinical verification before placing into routine use.

Sources of inherent sequencing error in the NGS raw data

At the level of the NGS raw data, the manufacturers of sequencing instrumentation have gone to great lengths to obtain the highest possible data quality. Yet the inherent difficulties remain due to the multiple steps from purified nucleic acid to a FASTA file.

That DNA sample undergoes transformation into a library molecule. If a PCR-based method is used, each round of PCR may introduce an error into the final set of molecules; hybridization-based enrichment goes through PCR as well, though typically only a few rounds of amplification. While the error rate using proofreading enzymes may be on the order of 1 in 50,000, there is a level of sequencing error introduced via amplification.

That collection of library molecules undergoes cluster amplification. This amplification step is within the sequencing instrument, and while very low, may also introduce some level of sequencing error. Then the bases need to be called (A / G / T / C or N for ‘no-call’) upon an evaluation of the existing signal-to-noise. In the context of massively parallel, next-generation sequencing, a single cluster may overlap partially to its neighbor, decreasing the quality of that call and giving a lower quality score for that base.

Remember, though, that any misincorporation of an errant base in the prior library preparation or cluster amplification steps will be a high-quality base from the sequencer, just an incorrect base that was sequenced properly by the system.

Sources of additional error in alignment and making a call

Now with an abundance of high quality sequence reads as a FASTA file comes the difficult work of translating that to a Variant Call File (known as a VCF). This is called difficult not because of the discrete steps involved, but rather due to the complexities of the genome.

Take PIK3CA (phosphatidylinosital 3-kinase p110a catalytic subunit) as an example of an important oncogene, one of the most highly mutated oncogenes in human colorectal, breast and liver cancers. This gene has been recently found to be important in determining clinical benefit to HER2-targeted therapies in breast cancer. This gene has a pseudogene present in the genome, which is a nearly exact copy except for the fact that the pseudogene has lost its function. Thus, when you map a sequence read which is a fragment of the PIK3CA target, you now have two places in the genome that it maps to. Your alignment and calling algorithms have to take the mutant pseudogene copy into account when determining if the sequence read came from the pseudogene or the genuine gene.

And then there is the calling itself; each base has a quality score attached to it, and each read will have its own quality score. How many mutation-containing reads at one particular site is sufficient? At what quality is sufficient—at the level of a single base, or at the level of the entire read? Is one high quality mutant read enough? Or four, or twelve, how much is enough? These are judgement calls, pages of parameters that have to be set and decisions made. This kind of work is what keeps bioinformatics scientists very busy (and well-employed to boot).

Some final considerations

The clinical oncology market for solid tumor analysis has a nominal threshold of a 5% minor allele frequency floor, below which standard informatics pipelines will refuse to make a call below 5%. In a clinical setting through clinical assay verification and dilution studies, the 5% allele frequency threshold is difficult to go below due to inherent limitations of the assay.

Pillar Biosciences, however, is able to go down to 2% from FFPE samples, and as low as 1% from high-quality samples, due to both the enrichment technology and the informatics pipeline. It’s worth giving a try, especially if you can relate to the difficulties in detecting low-frequency somatic mutations.For more information click here.

Would you like to receive emails whenever a new blog post is published on The Pillar Post? You are invited to signup here. You can also check out our other Pillar Posts here.