Autosomal DNA statistics
From ISOGG Wiki
Autosomal DNA statistics describe the connection between the genealogical relationship between two people and the amount of autosomal DNA which they share. Understanding this connection is critical to interpreting the results of an autosomal DNA test.
Autosomal DNA is inherited equally from both parents. The amount of autosomal DNA inherited from more distant ancestors is randomly shuffled up in a process called recombination and the percentage of autosomal DNA coming from each ancestor is diluted with each new generation.
When interpreting autosomal DNA statistics, one must be careful to distinguish between the distribution of shared DNA for given relationships and the distribution of relationships for given amounts of shared DNA.
For example, known second cousins on average share 212.5 centiMorgans (cMs), but in extreme cases can actually share as little as 47cMs or as much as 760cMs. Conversely, the relationship between pairs of individuals sharing 212.5cMs has been found to be anywhere between aunt-or-uncle/niece-or-nephew and third cousin once removed.
Autosomal DNA tests for finding cousins and verifying relationships for genetic genealogy purposes are offered by 23andMe, AncestryDNA and Family Tree DNA (the Family Finder test). For comparisons of the different services see Tim Janzen's autosomal DNA testing comparison chart.
- 1 Distribution of shared DNA for given relationships
- 2 Ranges of sharing percentage
- 3 Shared SNPs
- 4 Identical by descent segments
- 5 Distribution of genealogical relationships for given amounts of shared DNA
- 6 Blog posts
- 7 Resources
- 8 Scientific papers
- 9 References
- 10 See also
There are two simple mathematical methods of calculating the percentages of autosomal DNA shared by two individuals. Both methods give the same results except in the cases of parent/child comparisons, full siblings, double cousins, or any two individuals who are each related to the other through both parents.
The autosomal DNA of two related individuals will be half-identical in regions where each has inherited the same DNA from one parent, and ultimately from one common ancestor. In the cases of siblings and double cousins, their autosomal DNA will be fully identical in regions where each has inherited the same DNA from both parents or from two more distant common ancestors respectively. Full siblings are half-identical on regions where each has inherited the same DNA from exactly one parent and fully identical on regions where each has inherited the same DNA from both parents.
The first method of calculating percentages (displayed by 23andMe) expresses the aggregate length of the shared segments (i.e. the aggregate length of the half-identical regions, where there is one shared segment, plus twice the aggregate length of the fully identical regions, where there are two shared segments, one paternal and one maternal) as a percentage of the aggregate length of the paternal and maternal autosomes. Using this method, full siblings (excluding identical twins), who are expected to be half-identical on 50% of their autosomal DNA and fully identical on a further 25% of their autosomal DNA, will on average appear to have 50% shared.
The second method of calculating percentages (to which those relying on FTDNA or GEDmatch must resort) expresses the aggregate length of the half-identical (or better) regions as a percentage of the aggregate length of both sets of autosomes (paternal and maternal). The maximum value that the numerator in this percentage can take is the length of one set of autosomes (say the paternal); the denominator if the length of two sets of autosomes (maternal plus paternal). Thus, the percentages calculated by this method cannot exceed 50%, which is the value that it takes for a parent/child comparison (half-identical at all locations) or a comparison between identical twins (fully identical at all locations). Using Method II, full siblings (other than identical twins) will on average appear to have only 37.5% shared. Whenever there are fully identical regions, the calculated percentages will be smaller than for Method I as half-identical and fully identical regions cannot be distinguished from the available data and must be given equal weight in the calculation.
The first column in the table below shows the average percentages for different relationships and methods of calculation. The calculations assume that every child gets 50% from its mother and 50% from its father and in turn 25% from each of its four grandparents. The actual percentages vary from the average in individual cases. For example, a person might share 27% of his DNA with one nephew and only 23% with another. Because of the random way that autosomal DNA is inherited, third, fourth and more distant cousins will not necessarily have any detectable half-identical regions. According to Family Tree DNA's figures there is a 90% chance that third cousins will share enough DNA for the relationship to be detected, but there is only a 50% chance that you will share enough DNA with a fourth cousin for the relationship to be identified.
The degree of sharing is also displayed by the DNA companies in units of genetic distance known as centiMorgans (cMs), although in practice the total number of shared centiMorgans is less significant than the number and lengths of individual shared segments. The second column in the table below shows the aggregate lengths in cM of the half-identical (or better) regions shared on average by various pairs of relatives. It assumes that the aggregate length of each set of autosomal chromosomes is 3400cM, and thus that each individual inherits 6800cM of autosomal DNA, 3400cM from each parent. Different DNA companies use different conventions, so that the actual cM figures, as displayed by 23andMe's Family Inheritance: Advanced, FTDNA and GEDmatch, may be slightly different from these round numbers, even before allowing for random variation around the averages in individual cases.
The reason for the different results from Method I and Method II in the case of siblings and double cousins is that the cM lengths displayed by FTDNA and in the free GedMatch utility (and, indeed, 23andMe's own Family Inheritance: Advanced) do not distinguish between half-identical and fully identical regions. The best place to see the distinction between half-identical regions and fully-identical regions is in the optional graphical output of the one-to-one comparisons at GEDmatch.com, where FIRs are displayed in green and HIRs are displayed in yellow. It is also possible to see the fully identical regions at 23andMe by using the Family Traits chromosome browser (accessed via the Family and Friends menu).
When using Family Finder data, the percentages based on Method II can be calculated from the cM lengths by dividing the displayed Shared cM by 68.
Note that the FTDNA figures exclude the X-chromosome cMs but the 23andMe figures include them. Males have one X-chromosome and females have two X-chromosomes. If you want to include the X-chromosome in the calculations, then instead of dividing by 68, divide by 68.81065 when combining the atDNA with the X-chromosome. Note that the expected shared percentages of X-DNA depend not only on the genealogical relationship between two people, but also on the numbers of males and females in the two paths to their common ancestor.
23andMe include the X-chromosome in their calculations, so their cM figures will be higher than those provided by FTDNA. 23andMe made adjustments to the cM count in June 2013 so the number of cMs will vary slightly depending on when the test was taken.
- For females using 23andMe data prior to June 2013, there were 7494.8cMs when combining the paternal and maternal autosomal DNA and the two X-chromosomes per Family Inheritance: Advanced.
- For females using 23andMe data after June 2013 there were 7438.6cMs when combining the paternal and maternal autosomal DNA and the two X-chromosomes per Family Inheritance: Advanced.
- There are 7074.6 autosomal cMs per 23andMe.
- For males using 23andMe data there are 7256.8 cMs when combining the atDNA with the single X-chromosome.
Note that AncestryDNA do not provide information on the lengths of half-identical (or better) regions in either centiMorgans or percentages. However, AncestryDNA customers can upload their raw data to the free GedMatch utility in order to extract the necessary cM data for making comparisons and to check the relationship predictions. David Pike's tools can also be used.
|% shared||cM half-identical (or better)||Relationship||Notes|
|100% (Method I)/50% (Method II)||3400.00||Identical twins (monozygotic twins)||Fully identical everywhere.|
|50% (Method I)/37.5% (Method II)||2550.00||Full siblings||Half-identical on 50%/1700cM and fully identical on a further 25%/850cM.|
|25%||1700.00||Grandparent/grandchild, aunt-or-uncle/niece-or-nephew, half-siblings|
|25% (Method I)/23.4375% (Method II)||1593.75||Double first cousins||Half-identical on 21.875%/1487.5cM and fully identical on a further 1.5625%/106.25cM|
|12.5%||850.00||Greatgrandparent/greatgrandchild, first cousins, greatuncle-or-aunt/greatnephew-or-niece, half-uncle-or-aunt/half-nephew-or-niece|
|6.25%||425.00||First cousins once removed, half first cousins|
|3.125%||212.50||Second cousins, first cousins twice removed|
|1.563%||106.25||Second cousins once removed, half second cousins|
|0.781%||53.13||Third cousins, second cousins twice removed|
|0.391%||26.56||Third cousins once removed|
|0.0977%||6.64||Fourth cousins once removed|
|0.0244%||1.66||Fifth cousins once removed|
|0.0061%||0.42||Sixth cousins once removed|
|0.001525%||0.10||Seventh cousins once removed|
Notes to Table
- There is no variation between families in the parent/child or identical twins shared cM figures; beyond these immediate relationships, recombination results in random variation around the average figures above from one pair of individuals to another.
- When a grandchild is compared to a grandparent, the shared cM with the other grandparent on the same side is easily inferred. The grandchild gets all 3400cM of, say, his paternal autosomes from his father. If it is seen that 1600cM of this came from the paternal grandfather, then the other 1800cM must have come from the paternal grandmother. The initial estimate of 1700cM shared by grandchild and paternal grandmother can thus be updated to 1800cM when it has been ascertained that grandchild and paternal grandfather share only a below average 1600cM.
- When the subjects of the comparison descend from identical twin children of their most recent common ancestral couple, then the figures in the above table should be doubled.
- The expected % shared for a half-relationship will always be exactly half of the expected % shared for the corresponding full relationship.
- A similar method to that used for full siblings and for double first cousins can be used to compute expected shared percentages for any two subjects of comparison who are doubly related. However, the expected % shared for a double relationship can be slightly less than the sum of the expected % shared for the appropriate single relationships.
- If Jack is related to both of Jill's parents, then Method I and Method II will give slightly different figures, as double cousins of this type are expected to be fully identical in some regions.
- If Jill is a more remote descendant of spouses who are both related to Jack, then Jill will clearly have inherited at most one of the two segments in regions where the child of those spouses was fully identical to Jack. This reduces Jack and Jill's expected % shared slightly from the ballpark figure obtained by adding the expected % shared for the two relationships.
- For example, double second cousins, where the double relationship arises because at least one is related on both the paternal side and the maternal side to the other, are expected to share 3.125% (1/32) on each side, or 6.25% (1/16) in total, using Method I. Using Method II, a small adjustment must be made to allow for regions where they are fully identical (1/1024 or approximately 0.098%), so that they are expected to be half-identical or better on 63/1024 or approximately 6.152%.
- On the other hand, double second cousins who are children of double first cousins are expected to be half-identical on a quarter of the approximately 23.438% on which their parents are half-identical or better, in other words on approximately 5.859%.
The chart below (courtesy Dimario, Wikimedia Commons) shows the average amount of autosomal DNA inherited by all close relations up to the third cousin level.
Ranges of sharing percentage
Figures from 23andMe's Relative Finder:
- Parent/child: 47.54 (for father/son pairs, who do not share the X-chromosome) to ~50%
- 1st cousins: 7.31-13.8
- 1st cousins once removed: 3.3-8.51
- 2nd cousins: 2.85-5.04
- 2nd cousins once removed: .57-2.54
- 3rd cousins: ca .3-2.0
- 3rd cousins once removed: .11-1.32
- 4th and more distant cousins: .07-.5
Figures from 23andMe Compare Genes function (from Tim Janzen's data):
- Parent-child pairs share between 83.94% and 84.20% of SNPs (50% of DNA in common)
- Siblings share between 83.81% and 87.47% of SNPs (50% of DNA in common)
- Uncle/aunt-niece/nephew pairs share between 78.48% and 79.57% of SNPs (25% of DNA in common)
- Grandparent-grandchild pairs share between 77.96% and 80.59% of SNPs (25% of DNA in common)
- First cousins and great uncle/great aunt-grandniece/grandnephew pairs share 75.78% and 77.03% of SNPs (12.5% of DNA in common)
- First cousins once removed share ca 75.5% of SNPs (6.25% of DNA in common)
- Second cousins and first cousins twice removed share ca 75% of SNPs (3.125% of DNA in common)
- Unrelated people of European descent share 73-74.6% of SNPs
Identical by descent segments
It is important to remember that we do not inherit DNA segments from every genealogical ancestor. At ten generations we have approximately 1024 ancestors although there is generally some overlap as a result of pedigree collapse. While all these ancestors can potentially be documented in our genealogical tree we only inherit segments of DNA from a small subset of these ancestors. Luke Jostins found that "The probability of having DNA from all of your genealogical ancestors at a particular generation becomes vanishingly small very rapidly; there is a 99.6% chance that you will have DNA from all of your 16 great-great grandparents, only a 54% [chance] of sharing DNA with all 32 of your G-G-G grandparents, and a 0.01% chance for your 64 G-G-G-G grandparents. You only have to go back 5 generations for genealogical relatives to start dropping off your DNA tree."
In addition it is important to note that not all of the matching segments are true matches (identical by descent). Some segments, especially the smaller segments, will be false positive matches (identical by state).
In order to help people who have taken an autosomal DNA test gain greater insight into the genealogical relationships implied by the resultant data a number of genetic genealogists have been collecting statistics on the amount of DNA shared for known relationships.
Blaine Bettinger has been collecting statistics from the genetic genealogy community on the number of shared centiMorgans for known genealogical relationships as part of his shared cM project. The chart below is a visualisation of the range of shared centiMorgans for close relations up to the third cousin level based on data supplied to the project for over 6000 known relationships. The chart is made available under a Creative Commons Licence. You are free to share and use the information for non-commercial purposes, as long as you give proper attribution and release anything you create under the same licence. A high-quality version of the chart can be downloaded from this link. Data is still being collected for the project and you can add your own statistics using this form on GoogleDocs.
Tim Janzen has created three charts that provide statistical information in various categories. The charts provide statistics on close relatives, distant endogamous relatives and distant non-endogamous relatives. The charts were originally designed for use with 23andMe data but now also incorporate data from FTDNA's Family Finder test. The charts are organized by the degree of relationship, with the most closely related people (parents and children, full siblings) being listed at the top and more distant cousins being listed at the bottom. The statistics are based on information from real people who have been tested by 23andMe and Family Tree DNA and who have a known genealogical relationship to someone else who has also been tested by the same company. The charts also include information on the median and the average number of shared cMs for people who are related to each other from the first cousin once removed level of relationship to the 5th cousin level of relationship. The charts can be downloaded from Anabaptist Genetic Genealogy website.
Tim Janzen has also compiled a chart showing the probability of a given genealogical relationship for each cM threshold going up in one cM increments starting at 6 cMs and going up 200 cMs. The chart may be downloaded as an Excel file from this link. This chart applies to non-endogamous populations. When using the chart to predict relationship from Family Finder data you will need to remove the data for all segments under 4 or 5 cMs.
An unidentified author has also provided a spreadsheet on DNA Inheritance Statistics to which anyone can add their data. The spreadsheet can be found here.
- An analysis of fourth cousins and other near distant relationships by Jim Owston, Owston/Ouston One-Name Study blog, 10 August 2015.
- How many genomic blocks do you share with a cousin? by Graham Coop, The Coop Lab blog, 2 December 2013.
- How many genetic ancestors do I have? by Graham Coop, The Coop Lab blog, 11 November 2013.
- How much of your genome do you inherit from a particular ancestor? by Graham Coop, The Coop Lab blog, 4 November 2013.
- How much of your genome do you inherit from a particular grandparent? by Graham Coop, The Coop Lab blog, 20 October 2013.
- Widen the net by Judy Russell, The Legal Genealogist, 7 April 2013. A cautionary tale about third cousin matches.
- Genomic variation in sharing between siblings by Graham Coop, The Coop Lab blog, 26 January 2014.
- DNA portraits: second cousins by Jim Owston, Lineal Arboretum blog, 10th April 2012.
- Relatedness Qs and As. Ask a Geneticist, Understanding Genetics, Stanford at the Tech, 2 November 2011.
- The DNA numbers game by Lindsay Greenawalt, Confessions of a Cryokid Blog , 5 September 2011.
- Genetic genealogy and the single segment by Steve Mount. On Genetics blog, 19 February, 2011.
- Known Relative Studies by CeCe Moore, "Your Genetic Genealogist" blog, 26 September 2010 (this is a series).
- How many ancestors share my DNA? by Luke Jostins, Genetic Inference blog, 11 November 2009.
- Q&A: Everyone has two family trees - a genealogical tree and a genetic tree by Blaine Bettinger, The Genetic Genealogist, 10 November 2009.
- Unequal genetic similarity between one's mother and father by Maciamo. Eupedia forum, 8 August 2009.
Charts and tools
- MorganSense Relationships Prediction Utility from Gliesian, LLC.
- Cousin calculator Free download
- Cousin relationship calculator from Ancestor Search
- Genealogy relationship chart from About.com
- Ancestor chart from Hope Carnicle showing all the percentages up to the ninth cousin level
- Ancestor chart from Hope Carnicle. The same ancestor chart from Hope Carnicle in a diamond shape
Resources from FTDNA and 23andMe
- What is the probability that my relative and I share enough DNA to be detected by Family Finder? Family Tree DNA Learning Center.
- The probability of detecting different types of cousins 23andMe FAQ
- What's the average % DNA shared for different types of cousins 23andMe FAQ.
- Hill, WG & Weir, BS (2011). Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genetics Research, vol 93, no. 1, pp. 47-64. (See in particular Figure 5 which shows the distribution of actual genome sharing for different degrees of pedigree relationship.)
- In genetic genealogy, the verb 'test' is used loosely to describe the process of submitting a DNA sample and receiving results such as lists of possible relatives. This meaning should not be confused with the precise use of the same word by statisticians referring to the 'testing' of a hypothesis, which is either accepted or rejected based on the statistics observed.
- Tiny differences between identical twins can now be detected by next generation sequencing. See: Weber-Lehman et al 2014. Finding the needle in the haystack: Differentiating "identical" twins in paternity testing and forensics by ultra-deep next generation sequencing. Forensic Science International: Genetics; 9: 42-46. See also the editorial by Bruce Budowle in Investigative Genetics: Molecular genetic investigative leads to differentiate monozygotic twins.
- Jostins L. How many ancestors share our DNA? Genetic inference blog. 11 November 2009.