Quantcast
Channel: XHMM — GATK-Forum
Viewing all articles
Browse latest Browse all 26

EXOME.interval_list

$
0
0

Hi Menachem,

I followed the workflow at "https://atgu.mgh.harvard.edu/xhmm/tutorial.shtml" for XHMM and I am planning to use XHMM for an Exome project with 60 Samples. This is a question just to make sure that my Exome.interval list is fine.
I like Picard Style Interval list files because it comes so handy to use IntervalListTools.jar to sort by coordinates, merge and so on.
I made my Picard Style interval list as following:
samtools view -h <myfile.bam> > headerForIntervals.txt

I got my Exon list from UCSC as following:
UCSC>Table Browser> hg19> group=genes and predictions > Track= Ensemble genes> region=genome> output= bed> get output> bed records= Exons plus 0 > Exons.bed

And lines looked like:

chr1    66999065    66999090    ENST00000237247_exon_0_0_chr1_66999066_f    0   +
chr1    66999928    67000051    ENST00000237247_exon_1_0_chr1_66999929_f    0   +
chr1    67091529    67091593    ENST00000237247_exon_2_0_chr1_67091530_f    0   +
chr1    67098752    67098777    ENST00000237247_exon_3_0_chr1_67098753_f    0   +
...

Then I did:
cat Exons.bed | sed 's/chr//g' | awk '{FS=OFS="\t"; print $1,$2,$3,$6,$4}' > IntervalBody.txt && cat headerForIntervals.txt IntervalBody.txt > EXOME.interval_list

The final after sorting by IntervalListTools.jar looked like:

@HD VN:1.0  GO:none SO:coordinate
@SQ SN:1    LN:249250621
@SQ SN:2    LN:243199373
...
@PG ID:bwa  PN:bwa  VN:0.6.1-r104
@PG ID:GATK IndelRealigner  VN:1.6-13-g91f02df  CL:knownAlleles=[] targetIntervals=/export/working/ortak/131026_SN1030_0182_AC2G4JACXX_fq/Project_Run_25102013/Sample_ALS_173/ALS_173.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
1   11868   12227   +   ENST00000456328_exon_0_0_1_11869_f
1   12612   12721   +   ENST00000456328_exon_1_0_1_12613_f
1   13220   14409   +   ENST00000456328_exon_2_0_1_13221_f
1   11871   12227   +   ENST00000515242_exon_0_0_1_11872_f
1   12612   12721   +   ENST00000515242_exon_1_0_1_12613_f
...

Is everything fine? Was it the right way to get exons from UCSC? Is there a better way to get Exon list other than Tabel Browser? Should I get my list from the GTF file from Ensemble by running something like

awk '{FS=OFS="\t"; if ($3=="exon") print $0}' | awk '{FS=OFS="\t"; if ($2=="protein_coding") print $0}'

for GTF lines:

1 processed_transcript exon 13221 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";

Thank you so much!


Viewing all articles
Browse latest Browse all 26

Trending Articles