ChIPseq motif

From XenopusBioinfo
Jump to: navigation, search

You took fastq reads from an H3K4me3 experiment, made a genomic bowtie2 index from a fasta file, aligned the reads, converted the aligned sam files to bam files, sorted them, indexed them, and put them on IGV. You also called H3K4me3 peaks over background with findPeaks, converted them into bed files with pos2bed.pl. Nice! You guys did such a great job yesterday that today I am going to throw some typos into the commands. Just like real life! When they don’t work, you will have to fuss with them and figure out what went wrong. It may be that the target directory isn’t specified, or that I used the wrong option. Fortunately, if you run each command by itself, you will get that short instructional thing about how to use it. Should it not work, try running the command alone and looking at the syntax. You’ll figure it out.

First, let’s normalize the bam file that you’re looking at on the browser to a standard library size. We’re doing this because if you run different experiments and sequence to different read depths, the pileup of reads could be dramatically higher when they are not actually enriched any differently. There is a standardized file format for genome browsers called “bedGraph”. HOMER has a nice tool to do just that. Log into the server first:

$ ssh yourname@10.120.5.246

Then make your way to your directory with the chipseq alignments.

$ cd ~/chipseq

Now use this simple command to make the bedGraph file:

$ makeUCSCfile Baby_H3K4me3_tags/ -o auto

Log out of the server, and copy it back to your own computer:

$ scp yourname@10.120.5.246:~/chipseq/Baby_H3K4me3_tags/Baby_H3K4me3_tags.ucsc.bedGraph.gz .

If you shut down IGV last night, fire it up again, and load the baby genome, the gtf file, the baby sorted bam files, and now also load the new bedGraph file you just made. Now, when you do other chipseqs, you can do this with all of them and be able to compare them directly. You’re probably noticing that the peaks aren’t as discrete - I think this has to do with how the script using a sliding-window approach to add up the reads - so maybe play with all the makeUCSCfile settings to see how the outputs look different.

You probably noticed that most, but not all, H3K4me3 peaks in the browser lay on top of the transcriptional start sites. Suppose you only want the ones that are definitively promoters, based on the gene models. Even on that scaffold there are a lot and you wouldn’t want to do that by hand, and forget about the rest of the genome, where there are about 22,000.

Fortunately, there is an easy script in HOMER to do this. Go to the directory with the H3K4me3 peakfile (after you called the peaks in HOMER it is probably called “regions.txt”).

The tool we are going to use is called getDistalPeaks. Run it first to have a look at the options:

$ getDistalPeaks.pl

You’ll note among other options is one called -prox. getDistalPeaks was originally designed to get peaks far away from a feature, but the developer put in this option so you can get close ones. You’ll note that there are other options to get sequence in and around various features (gene bodies, etc.)

HOMER was designed for relatively well-studied genomes. Mouse, human, etc. Frog is not one of those (nor axlotl). Fortunately, I was able to bully the developer into including some custom genome functionality. This requires the fasta of the genome sequence and the annotation file (.gtf). To make things exceptionally smooth, I generated a .gtf file that is solely for the two scaffolds in your genome reference/index. It is in the common “reads” directory. Please find out how long it is. The path/name of the file is

/opt/xenopus/reads/Baby_chipseq.gtf

So now to get the subset of the peaks that are around Taejoon’s promoters, you will want to tell getDistalPeaks the appropriate parameters. I’ll assume you have all been paying attention and know where your peakfile from yesterday is and that you haven’t deleted anything in the past day, so go to that directory and run the following command:

$ getDistalPeaks.pl regions.txt ~/chipseq/Baby_chipseq_genome.fa -gtf /opt/xenopus/reads/Baby_chipseq.gtf -d 1000 -prox > H3K4me3_promoters.only.txt

It should keep some, but not all, of the H3K4me3 peaks.

Now that you’ve isolated only the promoter peaks, turn the peakfile into a bed file.

$ pos2bed.pl H3K4me3_promoters_only.txt -bed

And then download the bed file from the server onto your own computer. Next, load it into IGV along with the other stuff from yesterday: the original Baby_chipseq_genome.fa, the sorted bam files, the other bed files. Since it took me awhile to be smart enough to make a small .gtf file, you may want to replace the one from yesterday (Mayball_ensembl_v7b.gtf) with the new one on the server instead. It’s smaller, and if your laptop has been having memory issues this might help.

$ scp yourname@10.120.5.246:/opt/xenopus/reads/Baby_chipseq.gtf

and then load that file into IGV along with the others and skip the larger gtf file. Cruise around; you should find that this bed file only contains promoters actually straddling transcriptional start sites.

Now let’s do the reverse and only get H3K4me3 peaks that are far away from transcriptional start sites:

$ getDistalPeaks.pl regions.txt ~/chipseq/Baby_chipseq_genome.fa -gtf /opt/xenopus/reads/Baby_chipseq.gtf -d 1000  > H3K4me3_distal_only.txt

Convert that peakfile to a bedfile that IGV can interpret:

$ pos2bed.pl H3K4me3_distal_only.txt -bed

Get that from the server onto your own machine:

$ scp yourname@10.120.5.246:/opt/xenopus/reads/H3K4me3_distal_only.bed .

Load that into IGV, too. See how the features combined add up to the original peakfile.

Now you have 3 peakfiles: you have one that’s got all the peaks, one that has peaks close to transcriptional start sites (promoters) and another has peaks far away from them (probably enhancers). Unfortunately, you still have to scroll through the damn browser to get a list of genes that the peaks are in! HOMER comes with a nifty tool called annotatePeaks that you can use to figure out what sequence features (in this case, protein-coding genes) are nearest to the peaks, how far away they might be, etc.

$ annotatePeaks.pl regions.txt ~/chipseq/Baby_chipseq_genome.fa -gtf /opt/xenopus/reads/Baby_chipseq.gtf

It should spit out a bunch of stuff. Now use “>” to put it into a file:

$ annotatePeaks.pl regions.txt ~/chipseq/Baby_chipseq_genome.fa -gtf /opt/xenopus/reads/Baby_chipseq.gtf > regions_annotated.txt

Copy this file from the server back to your computer, fire up excel and open the regions.annotated.txt file and look at what sort of annotation information you get.

Maybe you want to see what the general shape of the tag density and nucleotide frequencies look like across all promoters. Unfortunately, that small number of peaks (~200) is not gonna look like much. We’ve been using the “Baby” datasets so that the exercises wouldn’t be too computationally-intensive (for the mapping, etc). Log back into the server and copy the following peakfile to your ~/chipseq directory:

$ cp /opt/xenopus/reads/chipseq/H3K4me3_all_regions.txt .

Now let’s perform a similar operation on all these peaks by getting only the ones that are on promoters. Note that we are now going to tell getDistalPeaks to look for the full genome and gtf file as compared to the Baby version:

$ getDistalPeaks.pl H3K4me3_all_regions.txt  /opt/xenopus/fastas/full_sequences/LAEVIS_genome_7.1.repeatMasked.fa -gtf /opt/xenopus/reads/Mayball_ensembl_v7b.gtf -d 1000 -prox > all_H3K4me3_promoters.txt

While we’re at it, let’s get all the peaks that aren’t near promoters, too:

$ getDistalPeaks.pl H3K4me3_all_regions.txt  /opt/xenopus/fastas/full_sequences/LAEVIS_genome_7.1.repeatMasked.fa -gtf /opt/xenopus/reads/Mayball_ensembl_v7b.gtf -d 1000 > all_H3K4me3_enhancers.txt

I bet the peaks around promoters are pretty good, and I bet the other ones won’t be as clean. To figure this out, let’s look at a histogram of the reads. HOMER has a function to do this, it’s called -hist. You need to tell it what bin size to count the reads in (10 should be fine, option is -hist 10). You also have to tell it what tag directory to use to count, so you’ll need to use the H3K4me3 tag directory, too (option is -d /path/to/tag/dir). Just for kicks, throw in the nucleotide frequencies (option is -nuc). We also want to have a large enough window around the TSS to figure out what’s going on, so we’ll use +/-2500 bases. Command:

$ annotatePeaks.pl all_H3K4me3_promoters.txt /opt/xenopus/fastas/full_sequences/LAEVIS_7.1_1kb.fa -gtf /opt/xenopus/reads/Mayball_ensembl_v7b.gtf -hist 10 -d /opt/xenopus/reads/chipseq/pooled_H3K4me3.sorted.tags -nuc -size 5000  > H3K4me3_promoters_hist.txt

And let’s go ahead and do the same on the H3K4me3 peaks that aren’t found in promoters:

$ annotatePeaks.pl all_H3K4me3_enhancers.txt /opt/xenopus/fastas/full_sequences/LAEVIS_7.1_1kb.fa -gtf /opt/xenopus/reads/Mayball_ensembl_v7b.gtf -hist 10 -d /opt/xenopus/reads/chipseq/pooled_H3K4me3.sorted.tags -nuc -size 5000 > H3K4me3_enhancers_hist.txt

Now let’s log out of the server and copy those files to the laptop in front of you:

$ scp yourname@10.120.5.246:~/chipseq/*hist.txt .

Finally, open those files with excel and start making scatterplots. First use the first two columns, which will be the position and then the tag density: make a scatterplot.

Next, use the first column and last four columns, which will be the nucleotide frequencies.

Finally, copy one of the histogram plots and paste it on top of the other. Then you’ll see why everybody says H3K4me3 mostly hangs around promoters and not elsewhere.