Identical by descent

From ISOGG Wiki
Jump to: navigation, search
This page contains changes which are not marked for translation.

Other languages:English 100%

Contents

Identical by descent or identity by descent (IBD) is a term used in genetic genealogy to describe a matching segment of DNA shared by two or more people that has been inherited from a recent common ancestor without any intervening recombination. The segments are considered to match if all the alleles on a paternal or maternal chromosome are identical (barring rare mutations and genotyping errors) and if the minimum threshold conditions set by the testing company have been met. Being identical by descent is contrasted to being identical by state (IBS).

Everyone has two copies of each chromosome - one chromosome inherited from their father and one chromosome inherited from their mother. Matching segments can be on half-identical regions (HIRs) (matches on the paternal or maternal chromosome) or fully identical regions (FIRs) (matches on both the paternal and maternal chromosome). FIRs are generally only seen in full siblings, but are sometimes found in close cousins who share descent from an endogamous (intermarrying) population.

IBD segments can be measured in centiMorgans (a unit of genetic distance) or in megabases (a unit of physical distance). The three major autosomal DNA testing companies (23andMe, AncestryDNA and Family Tree DNA) all now report segment sizes in centiMorgans. AncestryDNA originally used megabases for their matching algorithms but converted to centiMorgans in about January 2014. Both 23andMe and Family Tree DNA provide information on the matching segments which can be downloaded into a spreadsheet, and they also provide chromosome browsers which allow customers to see a visual representation of the matching half-identical DNA segments. The chromosome browsers show only one chromosome from each pair of chromosomes and are unable to distinguish between the maternally inherited and paternally inherited chromosomes. AncestryDNA do not currently provide a chromosomer browser and do not provide access to the underlying matching segment data.

The segment boundaries are somewhat fuzzy. 23andMe round the segment length to the nearest tenth of a centiMorgan and round the segment start and end co-ordinates to the closest millionth base pair to reflect the uncertainty in the exact locations of the segment boundaries. Family Tree DNA round the start and stop positions to the nearest complete block of 100 SNPs on each end of the matching segment.

The origin of IBD segments is depicted via a pedigree.
The origin of IBD segments is depicted via a pedigree pf 12 individuals. Each box (male) and circle (female) represents an individual with two homologous chromosomes as bars. The top row shows three couples with their chromosomes colored differently. Due to crossing over, offspring inherit recombinant chromosomes of their parents. The first order cousins in the bottom row share one IBD segment (borders marked by grey lines). Both have inherited this IBD segment from the same individual, namely their grandfather (orange colored chromosome in the top row). Diagram courtesy of Gklambauer, Wikimedia Commons.

Identity by descent can be considered on various timescales. According to population genetics theory all individuals have common ancestry in the distant past, and we all have short, old IBD segments in common. For the purposes of genetic genealogy the focus is on detecting large IBD segments within a genealogical timeframe (effectively within the last ten generations) where there is a possibility of identifying the common ancestor through documentary records. Any given pair of individuals is related through many common ancestors, though many of these relationships will be too distant to result in detectable IBD segments. If the two individuals have ancestors from the same geographical region they might have many recent common ancestors, but many of the relationships will not result in IBD sharing, and there might only be one or two segments inherited from just a few of their many common ancestors. In a study of a European subset of the Population Reference Sample (POPRES) dataset it was estimated that for the most part IBD blocks longer than 4 cM come from 500 to 1,500 years ago, and blocks longer than 10 cM are within the last 500 years.[1]

In general it will be difficult to find the genealogical connection with the majority of your matches under about 15 cMs. Most people are unable to trace all of their ancestral lines back ten generations or so, and the common ancestral couple cannot therefore be identified. Even if a shared ancestral couple can be identified, without tracing all the other ancestral lines you cannot eliminate the possibility of shared ancestry on other as yet undocumented lines. Many matches under 15 cMs will in any case share ancestry more than ten generations ago and will be mostly beyond the reach of genealogical records.[2]

The following table provided by Tim Janzen can be used as a guideline when assessing your matches. These percentages will be different for endogamous populations.

Length of shared segment Likelihood you and your match share a common ancestor within 6 generations
More than 30 centiMorgans 90%
20-30 centiMorgans 50%
12-20 centiMorgans 20%
6-12 centiMorgans 5%
6 centiMorgans or less <1%

False positive matches

The companies' matching algorithms do not treat the paternal and maternal chromosomes separately. Consequently consecutive SNP results for a short segment of DNA may appear to be half-identical in two individuals when in actuality the DNA sequences are not identical because the SNPs match on opposing chromosomes or because of errors in the matching algorithms. False matches can be the result of pseudosegments (matching alleles zig-zagging backwards and forwards between the maternal side and the paternal side), compound segments and fuzzy boundaries.[3][4] For a good illustration and explanation of a pseudosegment (also known as a spurious segment, an erroneous segment or a phantom segment) see Don Worth's diagram.

A simple check on the false positive match rate can be made by looking at the percentage of matches received by a child which do not appear in the match list of either parent. False positive matching rates of 19.6%[5] and 23%[6] have been reported by genetic genealogists using FTDNA Family Finder data. It is, therefore, important to remember that if only one parent has been tested, one should not automatically assume that all the remaining matches can be assigned to the other parent. A significant proportion will be non-IBD.

False positive matches are more likely to be seen in unphased data (phasing is the process of assigning alleles to the mother or the father.) The highest degree of accuracy is achieved by using the phased data from a two-parent/one child trio, where the error rate for phasing is only 0.01%. In the absence of trio data it is possible to phase data by inference using samples from reference populations. This is sometimes known as pseudo-phasing. AncestryDNA is currently the only company to phase all the customer data prior to matching. Ancestry uses a proprietary phasing algorithm known as Underdog. When phased results were compared with a test set of 1,188 trio-phased samples Underdog had a phase error of 0.93% and an impute error of 1.09%.[7] In theory Ancestry should have a lower false matching rate than 23andMe and FTDNA, though there are still likely to be false positives, particularly for short phased segments in the 5-8 cM range. None of the the companies currently offer the facility for customers who have tested both parents to phase their own trio data. However, trio phasing can be done by uploading results to the free third-party website GedMatch.

In general, the larger the shared segments the more likely that the match is genuine. All matches of 15 cMs or more are IBD, and the majority of matches between 10 cM and 15 cM are IBD. As the predicted matching segments get smaller the false positive rate increases. For genuine smaller shared segments in the range of 5 cMs to 10 cMs the common ancestor may be as many as 10 to 15 generations or more back in time. True IBD segments under 5 cMs cannnot be reliably detected with the currently available genetic genealogy tests. Durand et al (2014) analysed phased data from 2,952 father-mother-child trios in the 23andMe dataset and identified a false positive rate of over 67% for 2–4 cM segments. They reported that "most 2–3 cM segments are erroneous and only segments longer than 5 cM have a negligible number of false positives".[8] The error rate is likely to be much higher in unphased data. As the false positive rate for segments under 5 cMs is so high these segments should generally not be used in genealogical analyses.

The following table shows the percentage of matching segments (half-identical regions) that are IBD versus those that are IBS (non-IBD). The results are based on an analysis of 9000 haplotypes by John Walden as reported to Tim Janzen.[9][10]

cM  % IBD one side phased  % IBD both sides phased  % lost by second phasing
15 100 100 0
13 98 99 1
12 98 97 1
11 94 90 4
10 89 86 4
9 85 80 6
8 78 62 23
7 63 42 33
6 42 26 38
5 29 14 52
4 18 5 72
3 9 1 89

The following analysis is provided by Tim Janzen. The table shows a comparison of Tim Janzen’s and Rachel Janzen’s matches at GEDmatch at various thresholds using the phased versus unphased Family Finder data.[9]

20 cM 15 cM 11 cM 10 cM 9 cM 8 cM 7 cM 6 cM 5 cM
TJ unphased 21 54 183 245 421 870 1813 3512 8716
TJ phased 21 54 180 175 263 351 542 1015 1713
RJ unphased 10 27 98 149 349 1042 2331 3673 7460
RJ phased 10 27 95 98 145 197 326 832 1564
 % IBD 100.0 100.0 97.9 69.3 53.0 28.7 20.9 25.7 20.3

The techniques of chromosome mapping and triangulation can be also be used to help to distinguish between IBD segments (true matches) and non-IBD segments (false matches).

Thresholds for matches

A long consecutive string of half-identical SNP results (typically about 7 cM / 700 SNPs, depending on the test's error rate and other factors) is required before one can infer that two matching DNA segments are probably identical by descent. Thresholds for length and number of mismatches (errors or mutations) are set by each testing company; these criteria must be met before the company will report that two individuals very likely inherited their matching segments from a common ancestor. Each company has made reasonable choices for thresholds. They are trying to strive for that "middle ground" between missing true matches where people share a common ancestor within the past 300–400 years, and including too many people as matches who don't share a common ancestor within the past 500 or so years.

Both Family Tree DNA and 23andMe use different thresholds for people who have self-identified that they are of Ashkenazi Jewish ancestry. For further details see the article on endogamy.

23andMe

For half-identical regions the thresholds are:

  • Autosomal: 7 cMs and at least 700 SNPs for the first segment; 5 cMs and 700 SNPs for additional segments and for people you are sharing with
  • X (male vs male): 200 SNPs, 1 cM
  • X (male vs female): 600 SNPs, 6 cM
  • X (female vs female): 1200 SNPs, 6 cM

The maximum amount of error tolerated in a half-IBD segment would be roughly 1 opposite homozygote per 300 SNPs; furthermore, each such opposite homozygote in a half-IBD segment must be separated by roughly 300 SNPs (i.e., the length of a sub-segment).

For fully identical regions, the thresholds are:

  • Autosomal and X: 500 SNPs, 5 cM.

The criteria roughly correspond to allowing an error rate of 1%.

The Countries of Ancestry feature (formerly known as Ancestry Finder) shows all matches down to 5 cM segments.[11] Only people who have participated in the questionnaire "Where are you from? will appear in this feature. Close matches are excluded from Countries of Ancestry.

In addition, 23andMe has set a cap of 1000 matches in DNA Relatives and Countries of Ancestry. For most Americans this threshold is excluding many valid matches. You can overcome the threshold to a certain extent by sending an introduction to your matches and/or inviting them to share genomes at the basic level. For people with Colonial American ancestry the threshold is excluding new matches below about 17 cMs. Without this cap Colonial Americans would probably have between 3000 and 5000 matches.[12]

Family Tree DNA

The FTDNA Learning Center has details of the Family Finder autosomal DNA matching thresholds at www.familytreedna.com/learn/autosomal-ancestry/universal-dna-matching/genetic-sharing-considered-match.

The consensus within the genetic genealogy community is that these thresholds are incorrect. Anecdotal evidence suggests that Family Finder does not report matches with a longest shared segment below about 7.69 cM, so this is the de facto autosomal threshold.

Tim Janzen has suggested that the current threshold appears to be 7.69 cMs and at least 500 SNPs for the first segment and a total of at least 20 cMs (including the shorter matching segments between 1 cM and 7 cMs).

Criteria for X-chromosome matches:

  • 1 cM and 500 SNPs for both males and females; matches must already meet the autosomal DNA matching criteria

AncestryDNA

AncestryDNA introduced a new matching system in November 2014.[13] Detailed FAQs and a technical White Paper can be viewed by AncestryDNA testees. AncestryDNA assigns confidence levels depending on the approximate amount of shared centiMorgans. The guidelines in the table below are included in the AncestryDNA white paper. These guidelines are based on phased haplotypes and will not necessarily apply to matches at 23andMe and Family Tree DNA where haplotypes are not phased prior to performing the matching process. Note that some genetic genealogists think that these percentages are far too high. See Tim Janzen's table above for what are considered to be more realistic estimates. Note too that these percentages will be different for endogamous populations.

Confidence score Approximate amount of sharing Likelihood you and your match share a recent common ancestor within 5 or 6 generations
Extremely high More than 30 centiMorgans Virtually 100%
Very high 20-30 centiMorgans 99%
High 12-20 centiMorgans 95%
Good 6-12 centiMorgans More than 50%
Moderate 6 centiMorgans or less 20-50%

Note that the AncestryDNA database is 99% American, and it is not yet known if these ranges will apply in the same way to other populations.

Previously AncestryDNA set their threshold for matches at 5 megabases. In around January 2014 they subsequently changed to using centiMorgans and the threshold was changed to 5 cM, but the earlier matches were not rerun. The previous thresholds for other relationships at AncestryDNA are given here.

Ranges of total centiMorgans of IBD segments based on family relationship

The following data has been supplied by Tim Janzen, and is based on 6761 cMs in FTDNA's Family Finder test.[9] The reference data can be found at http://mennodna.remotewebaccess.com/23andme/ref.html.

Relationship Range Expected Range of number of shared segments
Parent/child 3539-3748 cMs 23-29
First cousins 548-1139 cMs 888 cMs 17-32
First cousins once removed 220-638 cMs 444 cMs 12-23
Second cousins 86-426 cMs 222 cMs 10-18
Second cousins once removed 19-197 cMs 111 cMs 4-12
Third cousins 16-111 cMs 55.4 cMs 2-6?
Third cousins once removed 0-99 cMs 27.8 cMs 1-4
Fourth cousins 0-54 cMs 13.8 cMs 0-2

Excess IBD sharing

There are some regions of the genome that are prone to excess IBD sharing. These regions are known colloquially as pile-up regions.

The human leukocyte antigen (HLA) region on chromosome 6 has been found to exhibit a much higher amount of IBD sharing than expected both within and across populations. The genes in this region are responsible for regulation of the immune system in humans, and it is thought that favourable mutations have developed as a result of natural selection.[14] Gusev et al (2011) reported that in Ashkenazi Jews “the entire segment of chromosome 6, between 25 and 35 Mb, is shared among individuals unrecombined at least 4-fold more than any other region in the genome”. They also found that the HLA region had a very high sharing density in some of the other populations studied (Western European, Gujarati Indians, Luhya Kenyans and Yoruba Nigerians). [15] For a good explanation of the underlying reasons for the increased IBD sharing in the HLA region of chromosome 6 see the article by Ann Turner (2010) Up hill and down dale in the genomic landscape: the odd distribution of matching segments in the Journal of Genetic Genealogy.[16]

Albrechtson et al (2011) found a region on chromosome 8 centred on position 9.9 Mb which exhibited an excess of IBD sharing. High IBD sharing signals were also seen on each side of the centromere on chromosome 11. These regions contain clusters of olfactory receptors. Evidence for increased IBD was also seen on chromosomes 2, 3, and 5.[14]

Gusev et al (2011) found recurrently shared regions in chromosomes 2, 4, and 8 in unrelated individuals of different geographical origins, and even across different continents of origin.[15] The region on chromosome 8 which was enriched for IBD sharing was in the location between 11.1 and 13.3 Mbs and in a different position to that reported by Albrechtson.

Li et al (2014) studied genetic relationships from whole genome sequencing data and identified a number of regions with excess IBD in both their pedigree and control dataset using three established methods for IBD detection (GERMLINE, fastIBD, and ISCA). The authors found that "the regions of spurious IBD were largely consistent between the three IBD methods and among European, East Asian, and Mexican American population which is a strong indication that the IBD segments in these regions are artifactual."[17] The excess IBD regions detected in the control populations are shown in the figure below.

Regions where excess IBD is detected by three IBD methods among the control populations. Regions that give rise to excess IBD inferences in GERMLINE (A–C), fastIBD (D–F), and ISCA (G–I) IBD. Black and red shading denotes degree of excess IBD detected. Source: www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004144


The start and stop positions for the excess IBD regions identified by Li et al are shown in the table below.

The 14 regions of the genome greater than 5 cM with detected pairwise IBD identified in GERMLINE that exceeds the expected pairwise IBD by at least 4-fold between European controls. The genomic regions are given in hg19 coordinates. Source: www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1004144


Carmi et al (2014) analysed the whole genomes of 128 Ashkenazi Jews (AJs) and compared the samples with controls from a Flemish population. They found that there was a recent bottleneck in the Ashkenazi Jewish population of around 350 individuals 600–800 years ago. The authors phased the data for nearly all the chromosomes before performing their analysis. When they used a minimum segment length of 3 cMs they found that all AJs shared on average 1.85% of their genome. When they went up to 5 cMs they found that AJs shared on average 0.84% of their genome. Peaks of IBD sharing over 5 cMs in size were seen on chromosomes 9 and 19 in Ashkenazi Jews. However, the authors were unable to detect any "enrichment of sharing in the HLA region (≈chr 6:25-35 mB) as previously observed in Jewish and non-Jewish populations".[18]

AncestryDNA analysed the matches of over half a million US customers genotyped on an Illumina OmniExpress chip (~700,000 SNPs) and found that in certain regions of the genome hundreds and sometimes thousands of people were predicted to share DNA with each other. The problem was particularly pronounced in customers of Jewish ancestry and in some people of Hispanic descent, but the problem was observed across all ethnic groups. They concluded that these regions of identical DNA were likely to be attributed to a shared ethnic history rather than recent common ancestry.[19][20] Ancestry introduced new phasing and matching algorithms in November 2014 to remove the false matches resulting from these "pile-ups".[21] The technical details can be found in their matching white paper “Discovering IBD Matches across a large, growing database” which is available to AncestryDNA customers. However, the locations of the identified "pile-up" regions have not been disclosed.

See also the post Chromosome pile ups in genetic genealogy: examples from 23andMe and FTDNA on the Genealogy and Genomics blog.

Whole-genome sequencing

The terms IBD and IBS are more relevant to the results of SNP microarray testing than to results of whole-genome sequencing, because microarray testing provides so much less information per centiMorgan of DNA. Microarray test results have an additional complexity since they report on both copies of the chromosome, but the results (with the exception of AncestryDNA) are not phased (that is, it is unknown which nucleotide is on which copy of the chromosome). Thus if one person's SNP result is (CC), this could be at least "half-identical" to either (CC) or (CT) in a second person. A homozygous mismatch such as (CC) vs. (TT) would be required before one could say the results are *not* identical.

Whole-genome sequencing (WGS) using next generation sequencing (NGS) technology, is not currently affordable for the genetic genealogy market, but is being used in academic studies: IBD tools are able to detect all 1st through 6th degree relationships and 55% of 9th through 11th degree relationships, a 5% to 15% increase in relationship detection compared to high-density microarray data.[17]

Whole-genome sequencing can also be used to detect more distant relationships for persons with 8th and 9th degrees of relationship by counting the total number of shared very rare SNPs.[22]

Further reading

Recommended blog posts

Scientific papers

References

  1. Ralph P, Coop G (2013). The Geography of Recent Genetic Ancestry across Europe. PLOS Biology 11(5):e1001555.
  2. Data supplied by Tim Janzen. See Tim Janzen's presentation "Discovering and Verifying your Ancestry using Family Finder" at the 2014 Family Tree DNA Conference on Genetic Genealogy held in Houston, Texas, on 11 October 2014. The slides can be downloaded from http://tinyurl.com/p22ejo4
  3. For a detailed explanation see: Turner A. "Satiable Curiosity: Identity Crisis: Identical by State or Identical by Descent?" Journal of Genetic Genealogy Fall 2011, Volume 7.
  4. For an explanation of compound segments see the FTDNA FAQ I have tested my husband, our child, and myself. Why does our child have matches that neither parent has?
  5. Waterbury B. Father / mother/ child trio data. Rootsweb autosomal DNA mailing list, 6 October 2014.
  6. Kennett D. Father / mother/ child trio data. Rootsweb autosomal DNA mailing list, 7 October 2014.
  7. Ball CA et al (2105). "Matching White Paper: Discovering IBD Matches across a large, growing database." This paper is available to AncestryDNA customers and can be accessed via their personal DNA accounts
  8. Durand EY, Eriksson N, McLean CY. Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis. Molecular Biology and Evolution advance access publication online 30 April 2014.
  9. 9.0 9.1 9.2 The information in this table was included in Tim Janzen's presentation "Discovering and Verifying your Ancestry using Family Finder" at the 2014 Family Tree DNA Conference on Genetic Genealogy held in Houston, Texas, on 11 October 2014. The slides can be downloaded from http://tinyurl.com/p22ejo4
  10. See also the files on John Walden's website and particularly the chart atDNA segments that survive after phasing both sides.
  11. 23andMe Customer Care. DNA Relatives: How it Works.
  12. Janzen T. DNA Relatives list. Message posted to the ISOGG Project Administrators mailing list (closed group), 6 January 2015.
  13. Swayne A. DNA matching just got better. Ancestry blog. 19 November 2014.
  14. 14.0 14.1 Albrechtson A, Moltke I, Nielsen R. Natural selection and the distribution of identity-by-descent in the human genome. Genetics 2010; 186 (1 ):295-308. See Table 2 for the genomic positions of the regions on the different chromosomes where the peaks of IBD sharing were found.
  15. 15.0 15.1 Gusev A, Palamara PF, Apontel G et al. The architecture of long-range haplotypes shared within and across populations. Molecular Biology and Evolution (2011). doi: 10.1093/molbev/msr133. First published online: 6 October 2011. See Table 2 for details of the regions of high IBD sharing Annotated regions enriched for IBD sharing between HapMap populations.
  16. Turner A (2010). Up hill and down dale in the genomic landscape: the odd distribution of matching segments. Journal of Genetic Genealogy 6(1).
  17. 17.0 17.1 Li H, Glusman G, Hu H et al 2014. Relationship estimation from whole-genome sequence data. PLoS Genetics Jan 2014; 10(1): e1004144.
  18. Carmi S, Hui KY, Kochav E et al (2014). Sequencing an Ashkenazi reference panel supports population-targeted personal genomics and illuminates Jewish and European origins. Nature Communications 2014; 5: 4835. See in particular section 4.2 in the supplementary data pdf file where the authors discuss the IBD analysis.
  19. Chahine K. AncestryDNA matching update impacts Jewish ancestry. Ancestry blog, 1 August 2014.
  20. Swayne A. DNA matching just got better. Ancestry blog, 19 November 2014.
  21. Bettinger B. Goodbye false positives! AncestryDNA updates its matching algorithm. The Genetic Genealogist blog, 19 November 2014.
  22. Al-Khudhair A, Qiu S, Wyse M et al. Of Distant Genetic Relations In Humans Using “1000 Genomes”. Genome Biology and Evolution (2015) doi: 10.1093/gbe/evv003. First published online: January 7, 2015.

See also