DNA Copy Number and its Effects on Log2 Values and Allelic Ratios at Different Tumor Purities

Post date: Jan 27, 2018 6:33:35 PM

This is a post that I've been meaning to put together for many years on what log2 values in DNA copy number analysis actually mean. I'll start by admitting my own error as a post-doc when I assumed incorrectly that log2 of -1 = 1 copy state, 0 = 2 copy state, and 1 = 3 copy state. Hopefully if you are reading this post you already know that 3 copy number state is actually a log2 of 0.58 and log2 of 1.0 actually represents a 4 copy state. Over the years I've made quick tables in excel to convince myself and show others what log2 values you expect in theory for each copy number state of a tumor population compared to a normal diploid population. I've also done it over and over again to establish the expected log2 value if an event was in 100% of the tumor cells but the purity of tumor cells in the tested population is not 100%. So I've finally broken down and put it all together once and for all so I never have to do it in excel again. Thus this post is mostly for my own sanity and time management but hopefully its useful to someone else out there as well.

Outlining the Problem

Whenever we are looking at tumor samples and trying to assess their copy number state we often compare the tumor sample to a normal sample. Now this might use any number of methods from CGH or SNP microarrays to exome or genome sequencing assays. But for the most part we have a population of tumor cells and a population of normal cells to compare. The two most common approaches are to compare the signal or counts from the tumor to the normal and then log2 transforming the ratio (ie. value = log2 ( Tumor / Normal ) ) or to look at the B, or alternate allele, frequency in an absolute allelic percentage. The absolute allele frequency is calculated by first identifying the heterozygous positions in the normal and then subtracting a normal heterozygous frequency of 0.5 (50%) from the observed frequency at each matching position in the tumor (ie. absolute allele frequency = abs ( Observed B allele frequency - 0.5 ) ). This absolute allele frequency is used to correct for the difference in allele frequencies observed across a chromosome, haplotype, to correct for the fact that the reference allele is equally distributed between both the maternal and paternal derived chromosomes, haplotypes, which often causes a mixture of two different B allele frequencies at copy number states without an equal mixture of A and B alleles (Figure 1).

Figure 1 - Example Copy Number States and Associated Allele Frequencies

Representative cells are shown for a normal diploid cell and tumor cells with copy number states from 0 to 4. For simplicity the two possible homozygous allele states for the 4 copy state are not shown. In all cases the copy number of a region or allele frequency is represented by the vertical blue or purple bars. Loss of heterozygosity cells are noted as is the unique situation of copy-neutral LOH that can be observed in the two copy number state. Personally I would never call this uniparental disomy (UPD) as it is a somatic process.

Basics of Raw Copy Number Estimation

To establish the log2 copy number value for each potential state my standard practice is to build a table assuming there is 100 cells in the tumor population and each chromosome counts for 1 signal, or count, value. Then for each chromosome lost or gained I alter the tumor signal/count proportionally. These theoretical tables and resulting distributions are outline in Table 1 and Figure 2.

Table 1 - Theoretical Log2 Values for Distinct Copy Number States

* The zero copy number state truly has a signal/count of 0 but since it would result in an infinity value for presentation purposes I show a single signal/count value

Figure 2 - Theoretical Log2 Values for Distinct Copy Number States in Pure Tumor Populations Compared to a Normal Diploid Control

Horizontal dotted dark mustard lines define our established mean^+/-5SD cutoff used in the MMRF CoMMpass study and the vertical light green dotted line indicates the expected value for a normal diploid tumor population. In both plots the homozygous deletion, bi-allelic deletion, 0 copy number state that actually has a log2 value of infinity is hard coded to -5 which is in the functional range I've personally encountered over the years with aCGH or NGS in the -4 to -6 range A) Theoretical log2 values for copy number states 0-50. B) Theoretical log2 values for copy number states 0-10 with the actual expected log2 value noted.

Basics of B Allele Frequency Assessment

When used alone B allele frequency has a limited utility as the observed B allele ratio and absolute frequency can be associated with multiple copy number states (Table 2). But when integrated with the know copy number state it can be very valuable in determining the allelic state and can be leveraged to calculate the purity of a sample. So to establish the absolute B allele frequency I build a table assuming there are 100 tumor cells in the population and then assign 1 count to each allele present in the theoretical scenario (Table 2). In the most simplistic scenario you can uses the absolute allele frequency to create and expected distribution of the allele frequency for each copy number state assuming the copy number is only changing for one allele while the other allele stays constant at the normal state of 1 copy (Figure 3). As illustrated in Table 2, this is a simplification but is also not uncommon at copy number states 1-4 in my personal experience as I'd say the most common alleles observed are AO, OB, AB, AA, BB, AAB, ABB, AABB, AAAB, ABBB.

Table 2 - Absolute B Allele Frequencies for Distinct Copy Number States