Phasing

From ISOGG Wiki
Jump to: navigation, search
This page contains changes which are not marked for translation.

Other languages:English 100%

Phasing is the task or process of determining the parental source of a SNP's alleles (the As, Cs, Ts and Gs). A simpler way to put it, it is the process of trying to determine which DNA came from the mother, and which came from the father. The term is usually applied to types of DNA that recombine, such as autosomal DNA or the X-chromosome. The benefit of phasing is being able to identify which ancestor a segment was inherited from.

Contents

Who to test?

For the best results one should test a family trio (you and both your parents). It is possible to phase about 94% of the SNPs in an autosomal dataset using a two parent/one child trio. The number of SNPs that can be phased is marginally increased if siblings are also tested. Roach et al found that they were able to phase 98.8% of the SNPs by using data from two parents and four children.[1]

If only one parent is available for testing first test the parent, and all of that parent's children. Then test at least one of the parent's grandchildren through each of the parent's children who had children. It would also be reasonable to test the spouses of the parent's children since that increases the amount of the data you can phase.

If no parents are available for testing first test all children of the family up to at least five (assuming five or more are available for testing). Then test at least one of the parent's grandchildren through each of the parent's children who had children. It would also be reasonable to test the spouses of the parent's children since that increases the amount of the data you can phase.

Once you have done the above then start concentrating on testing first and second cousins of the parents. There will be a diminishing return after about five or so first cousins, but it makes sense to test as many first cousins as you can afford to test up to some limit.

Methods

Phasing has not, as yet, been incorporated into the computer programs of the various commercial DNA companies who offer autosomal and X-chromosome tests. Early pioneers of autosomal phasing, like Whit Athey and Tim Janzen, use Microsoft Excel. (NOTE: Do not use versions of Excel prior to 2007 since they will not have enough rows. Phasing can be done also with the free and open-source office suite LibreOffice.)

Tim Janzen has written a program in Excel that will phase either 23andMe or Family Finder data from two parents and one of their children. The program can be downloaded from Tim's Dropbox account at:

http://dl.dropbox.com/u/21841126/phasing%20program%20%28small%20version%29.xls.

Instructions on how to use the program may be found at

http://dl.dropbox.com/u/21841126/phasing%20program%20instructions.rtf.

Tim has also uploaded a small version of the program that includes sample data from two parents and one of their children for 500 SNPs which will give people an idea of what the output looks like on a small scale. The program can be downloaded here:

http://dl.dropbox.com/u/21841126/phasing%20program%20%28sample%20data%29.xls

David Pike's program at http://www.math.mun.ca/~dapike/FF23utils/trio-phase.php is another option that does the same thing. David's program does one chromosome at a time whereas my program can do multiple or all of the autosomal chromosomes at once assuming that your computer can handle a large Excel file with all of the data in it.

Process

To phase your X-chromosome, all you need is knowledge of your ancestry. For autosomal DNA phasing, you need to have autosomal results for a child and both parents along with matches to any other known relatives.

The following autosomal DNA phasing process was shared by Tim Janzen:

To start with you need a parent's 23andMe results or Family Finder results and you need one of their children's results. Then, compile a list of all of the known relatives of the parent who have been tested who are either first cousins to the parent or more distantly related to that parent.

If you are using 23andMe data for comparison you then go to the Family Inheritance: Advanced section in Ancestry Labs.

  • Set the parent as the person whose genome is open in 23andMe by clicking on that name in the drop down menu up at the top of the screen next to "Account".
  • Set the child as the 2nd optional family member in the "Family Inheritance: Advanced" section.
  • Select the person to be placed in the "Select a person in question" box. This person needs to be a first cousin or someone else known to be a distant relative of the parent.
  • Click on "Compare".
  • Click on "View in a table". This will generate a list of matching segments for both the parent and for the child.

If you are comparing data in Family Finder the technique is somewhat different:

  • First log into the parent's account and go to the "Chromosome Browser" section of Family Finder.
  • Find the relative in the list of matches.
  • Click on the box next to the relative's name and then click on "Download to Excel".
  • Save the Excel file to the hard drive.
  • Log into the child's account and go to the "Chromosome Browser" section of Family Finder.
  • Find the relative in the list of matches.
  • Click on the box next to the relative's name, click on "Download to Excel", and save the Excel file to the hard drive.
  • Open both Excel files and copy the data from one Excel file so that it is included right below the data in the other Excel file.
  • Then sort that file by the chromosome and then by start location so that all of the matching segments for both the parent and the child are adjacent to each other in the file.
  • Delete all segments that are under 5 cMs for which the parent and the child do not both share a corresponding matching segment with the relative.

It may be reasonable to map some segments in the 3-5 cM range if both the parent and the child share that same segment with the relative but caution is warranted when mapping segments that don't contain at least 700 or more SNPs because some matching segments could be IBS (identical by state) and not IBD (identical by descent).

In the Excel file I mentioned above in the first paragraph, first fill in the start point in column G with the name of the ancestor that the parent must have received the segment from for all of the segments in which the child also matches the other relative. For example, if a first cousin of the parent is being used for comparison then you would enter into the cell the name of the grandparent of the child who was related to the first cousin in the appropriate cell in the file, the grandparent's year of birth and the name of the relative used for comparison. For instance, if the child's start point was at position 80,000,000 on chromosome 5 then you would fill cell G177752 with the name of the grandparent, their YOB (year of birth) and the name of the relative. Then copy the contents of cell G177752 and drag the contents down to fill in all cells in column G down to the endpoint of that segment. If the end point was at position 104,500,000 then fill in the column all the way down to cell G181568. Simultaneously fill in the contents of cells H177752 to H181568 with the name of the spouse of the grandparent and their YOB. I like to add "inf" after the name and the YOB to designate that I have inferred this information from the fact that I have already filled in the corresponding cells in column G. Repeat this same procedure for every segment that the child shares with the relative used for comparison.

Then review the segments that the parent shares with the other relative but the child does not. For instance, say that the parent's matching segment on chromosome 5 starts at 65,000,000 and continues to 110,000,000. Go to cell H174981 and enter the name of the grandparent of the child who was related to the first cousin, their YOB, and the name of the relative used for comparison. I also like to add "inf" after this information to indicate that I inferred the information from the fact that the child doesn't share that matching segment. You would then fill the contents of cell H174981 down to cell H177751 with the same information as is in cell H174981. You would next fill in cells H181569 to H182595 with the same information. This would cover the segment between position 104,500,000 and 110,000,000. I would simultaneously fill in the contents of cells G174981 to H177751 and G181569 to G182595 with the name of the spouse of the grandparent, their YOB, and "inf" after that. In similar fashion continue filling in all the appropriate cells in column H for the segments that the parent shares with the other relative but the child does not and adding the corresponding information for those segments in column G.

Once you have recorded the information as recommended for the first relative selected for comparison in the paragraph above then continue in similar fashion for all known relatives who are related no more closely than at the first cousin level of relationship to the parent. If there are segments that multiple relatives share in common with the parent, then modify the contents of the corresponding cells to include the name or the initials of each relative who shares that same segment. This information can be helpful in situations where there is some ambiguity about the start or stop locations for a particular segment. For instance, if you look at my file you will see that in cell G182987 the entry JY1823 Darryl Youngman, FM, and MY. This information continues down to cell G185532. This means that my relatives Darryl Youngman, Frederick Mock, and Mason Youngman all share this segment with both my mom and I. Thus it appears in column G. The ancestor that all five of us share in common is Jacob Youngman (b. ca 1823). The segment of DNA just above this segment in the file was shared by Darryl Youngman, Mason Youngman, my mom and me, but not by Frederick Mock.

Once you have gone through the comparisons for all of the relatives in your list save it and keep it on hand for reference. When you have a new match in 23andMe compare them in "Family Inheritance: Advanced" section in "Ancestry Labs" to the parent and the child or compare them in Family Finder as I described above. If both the parent, the child, and your new match all share a segment that you have previously mapped in the file you created as described above then you know that the ancestor listed in column G for that particular segment of DNA must have shared ancestry with your new match. If the parent shares a segment in common with your new match, but the child doesn't share that segment with the match then you know that the ancestor listed in column H for that particular segment must have shared ancestry your new match.

It is important to bear in mind that small matching segments can be problematic for the purposes of mapping, particularly for data in 23andMe. Let's say for instance, a parent shares a matching segment that is 8 cMs in length with a known second cousin but the child doesn't share that segment. You would thus have mapped that entire segment to column H in your file. However, if by chance there had been a crossover during meiosis in the middle of that segment before a portion of it was passed on to the child then about 4 cMs of the 8 cM segment would be inappropriately mapped to column H when it actually should have been placed in column G. For this reason, I initially started by mapping matching segments that were 10 cMs or longer if a relative shared a segment with the parent but not with the child. If there was a crossover in a segment 10 cMs or longer then the child should have received a segment that was at least 5 cMs in length and would thus show up as a matching segment for the child.

One advantage of Family Finder is that FF has a 1 cM threshold for matching segments. If a parent and a child both have a matching segment that is in the 2 to 5 cM range and if the number of matching SNPs is 500 or more then there is a reasonably high likelihood that the matching segment is IBD (identical by descent) and not IBS (identical by state). Another advantage of Family Finder is that Family Finder doesn't round to the nearest millionth base pair like 23andMe does, so there is greater clarity as to the precise boundaries of each matching segment. I have been able to get around the first issue partially and the second issue entirely for 23andMe data by including my family and their relatives in the Mennonite project where we use a 4 cM threshold for matches and where the matching segments are not rounded to the nearest millionth base pair.

Endogamous populations represent another challenging group for mapping purposes since there is a higher likelihood in such populations that matching segments will be IBS (identical by state) and not IBD (identical by descent). In such populations it might be wise to initially start by mapping segments that are at least 10-15 cMs and/or contain greater than 1000 or so SNPs. Phasing data from endogamous populations before doing comparisons is a very good approach and reduces the probability that any particular matching segment will be IBS.

Scientific papers

Articles

Blog posts

References

  1. Roach JC, Glusman G, Hubley R et al. Chromosomal haplotypes by genetic phasing of human families. American Journal of Human Genetics Volume 89, Issue 3, 382-397.

See also