Hi Menachem,
I followed the workflow at "https://atgu.mgh.harvard.edu/xhmm/tutorial.shtml" for XHMM and I am planning to use XHMM for an Exome project with 60 Samples. This is a question just to make sure that my Exome.interval list is fine.
I like Picard Style Interval list files because it comes so handy to use IntervalListTools.jar to sort by coordinates, merge and so on.
I made my Picard Style interval list as following:
samtools view -h <myfile.bam> > headerForIntervals.txt
I got my Exon list from UCSC as following:
UCSC>Table Browser> hg19> group=genes and predictions > Track= Ensemble genes> region=genome> output= bed> get output> bed records= Exons plus 0 > Exons.bed
And lines looked like:
chr1 66999065 66999090 ENST00000237247_exon_0_0_chr1_66999066_f 0 +
chr1 66999928 67000051 ENST00000237247_exon_1_0_chr1_66999929_f 0 +
chr1 67091529 67091593 ENST00000237247_exon_2_0_chr1_67091530_f 0 +
chr1 67098752 67098777 ENST00000237247_exon_3_0_chr1_67098753_f 0 +
...
Then I did:
cat Exons.bed | sed 's/chr//g' | awk '{FS=OFS="\t"; print $1,$2,$3,$6,$4}' > IntervalBody.txt && cat headerForIntervals.txt IntervalBody.txt > EXOME.interval_list
The final after sorting by IntervalListTools.jar looked like:
@HD VN:1.0 GO:none SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
...
@PG ID:bwa PN:bwa VN:0.6.1-r104
@PG ID:GATK IndelRealigner VN:1.6-13-g91f02df CL:knownAlleles=[] targetIntervals=/export/working/ortak/131026_SN1030_0182_AC2G4JACXX_fq/Project_Run_25102013/Sample_ALS_173/ALS_173.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
1 11868 12227 + ENST00000456328_exon_0_0_1_11869_f
1 12612 12721 + ENST00000456328_exon_1_0_1_12613_f
1 13220 14409 + ENST00000456328_exon_2_0_1_13221_f
1 11871 12227 + ENST00000515242_exon_0_0_1_11872_f
1 12612 12721 + ENST00000515242_exon_1_0_1_12613_f
...
Is everything fine? Was it the right way to get exons from UCSC? Is there a better way to get Exon list other than Tabel Browser? Should I get my list from the GTF file from Ensemble by running something like
awk '{FS=OFS="\t"; if ($3=="exon") print $0}' | awk '{FS=OFS="\t"; if ($2=="protein_coding") print $0}'
for GTF lines:
1 processed_transcript exon 13221 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "3"; gene_name "DDX11L1"; gene_biotype "pseudogene"; transcript_name "DDX11L1-002"; exon_id "ENSE00002312635";
Thank you so much!