Geno 2.0 raw data

From ISOGG Wiki
Jump to: navigation, search

This article explains how to manipulate the Geno 2.0 raw data files from the Genographic Project.

Contents

Download raw data

Extract .csv file

To extract YOURIDNUMBER.zip files:

  • on Windows you can use the integrated decompression or free 7-Zip, WinZip or WinRAR.
  • on Linux ZIP extraction is also integrated on most GUI systems

View and edit .csv file

A .csv file is a spreadsheet format that can be opened by Excel 2007 and later versions, OpenOffice Calc, LibreOffice Calc, etc. With Excel 2003 or previous versions, in which the number of rows is limited to 65536 or less, you won't be able to open big files. As the .csv file is a text file format, it can also be opened by text editors such as notepad++.

Until summer 2013 all information was stored only in one file. Since then the information has been made available in four files:

  • YOURIDNUMBER.all.csv - All SNPs
  • YOURIDNUMBER.autosomal.csv - Autosomal DNA + X-DNA
  • YOURIDNUMBER.mtdna.csv - mtDNA
  • YOURIDNUMBER.ychromo.csv - Y-DNA

You can identify the location of the SNPs by the column with the Chromosome id on the antepenultimate column:

  • Autosomal DNA: chr 1-22 ~126,307 SNPs
  • X-DNA: chr X ~3803 SNPs
  • Y-DNA: chr Y ~12,064 SNPs
  • mtDNA: chr 0 Differences are reported from the Revised Cambridge Reference Sequence. The number of SNPs will vary from one individual to the next.

If you would like to save a copy with the SNP data only from a certain region, delete all lines except those with the Chromosome id on the second (or third) position. An easy way to do that is to sort in descending order the column with the id: you then can delete all other lines.

Current Data format

At least since April 2013, probably since January 2013. Chromosome identifier on the second column

SNP;Chr;Allele1;Allele2
CTS100;Y;C;C
CTS10004;Y;G;G
CTS10009;Y;G;G
...

First Data format

Used in December 2012. Chromosome identifier on the third column

[Header]
GSGT Version,1.9.4
Processing Date,11/25/2012 0:01 PM
Content,,NGS_iSelect_v1_15030891_2012_B-wRS-2.bpm
Num SNPs,154476
Total SNPs,169786
Num Samples,1850
Total Samples,2796
File ,1680 of 1850
[Data]
GRC12122435_ChipNGv1_37760_F02,101SNP8856FG_A,0,G,G
GRC12122435_ChipNGv1_37760_F02,101SNP8860FA_G,0,C,C
GRC12122435_ChipNGv1_37760_F02,102SNP8856FG_A,0,G,G
...

Heterozygous values (Y-SNPs)

Almost all of the SNPs listed in the results file show two identical values (homozygous). A very small number show two different values (heterozygous). Heterozygous Y allele calls are interpreted as "derived" because what shows up as an AB allele is really a BB allele. This turned out to be quite reliable on many difficult Y-SNPs. [1]

References

  1. Thomas Krahn, Nov. 2012, http://tech.groups.yahoo.com/group/R1b-L21-Project/message/12752