Susi


 * Things to do:**

Aim 0) Quality Control Analysis for small and long RNA seq: DONE

Aim 1) mmu-mir-GUIDE: DONE

Aim 2) Loop gives the expression values of the Pre-MiRNA: DONE

Aim 3) Ratio of GUIDE / Loop tell you that DICER is working: DONE

Aim 4) Long RNA-Seq analysis. HEAT map of all differentially expressed DONE
 * Do the differential expression analysis for both condition on mRNA (long RNA) DONE

Aim 5) ITRAQ protein data analysis

Aim 6) link protein expression levels from iTRAQ data to miRNA data (overlap between iTRAQ and predicted miRNA tragets)

Aim 7) link mRNA expression levels from long RNA seq data to miRNA data (overlap between iTRAQ and predicted miRNA tragets)

The data located on the shared storage, which you can access from any bluehelix node. The path is: /data/gingeras/lab/datastore/Sequencing/Mouse/Short_RNA_Seq/trin/ File permissions are read only, There are 4 directories: 1224-12_UN 1224-5_UN 1224-8_UN 1309-10_UN Each directory contains many files, but as Carrie stated, the ones you're probably interested in will be: "xxx.txt.gz" - the fastq file "xxx.fpkm.gz" - fpkms "xxx.bam" - bam file "xxx.bigWig" - bigWigs

Long RNA seq data: /data/gingeras/lab/datastore/Sequencing/Mouse/Long_RNA_Seq/trin/1309-11/ /data/gingeras/lab/datastore/Sequencing/Mouse/Long_RNA_Seq/trin/1309-15/ /data/gingeras/lab/datastore/Sequencing/Mouse/Long_RNA_Seq/trin/1309-10/ /data/gingeras/lab/datastore/Sequencing/Mouse/Long_RNA_Seq/trin/1224-12/ /data/gingeras/lab/datastore/Sequencing/Mouse/Long_RNA_Seq/trin/1224-5/ /data/gingeras/lab/datastore/Sequencing/Mouse/Long_RNA_Seq/trin/1224-8/

Short RNA seq: /data/gingeras/lab/datastore/Sequencing/Mouse/Short_RNA_Seq/trin/1309-11_UN/ /data/gingeras/lab/datastore/Sequencing/Mouse/Short_RNA_Seq/trin/1309-15_UN/ Etc.

L-RNA-SEQ Gene Expression, Heat Map. Microarray didnt find anything. L-RNA and S-RNA S-RNA<200 L-RNA > 200 Short: miRNA, snoRNA, etc

L-RNA: mRNA

Read Length? Single End,

Aim 1) mmu-mir-GUIDE: Mature Micro RNA we want (HEATMAP) for all with 2 conditions, 1st 3 cols are 1224: do NOT express Mu-P53 / SO these are p53-NULL the 1309 is a control hairpin and doesnt target anything so these cells express everything and MUTANT P53 which is what we are interested in Now, we need to just take the GUIDE mi-rna and compare these two condition to see which ones are most expressed? Significance. Heatmap to show the difference Mu-p53-down regulates DICER so that means that if there is less DICER in mu-p53 so there is less processing of Mature Mi-RNA Then, the hypo: we should see MANY mi-RNA being down regulated in the mutant and the heatmap will show that nicely. Dicer is not completely gone so there will be some that are somewhat down and some that are there

Try to find a program that takes a list of mi-RNA and spits out the pathways or target mRNA's that they target. MIR-BASE gives you the sequence for micro RNA Completed diff expression. **Need to generate the heatmap intensity scale. time needed:1 hour. Estimated completion date: Feb 2nd**

Aim 2) Loop gives the expression values of the Pre-MiRNA. Do the same above analysis with Loop now.
 * Need to generate the heatmap intensity scale. time needed:1 hour. Estimated completion date: Feb 2nd**

Aim 3) Ratio of GUIDE / Loop tell you that DICER is working, and low ratio means, Dicer is not working because there is accumulation of loop. No processing of Loop --> Guide **( Required Effort: 12 hours Estimated completion data: Feb 6th 2012)** Same as above for the ratio. Remove read lanes with less than 10 total. Then take the one row for guide between 3p and 5p, the row that has the maximum reads mapped. Now compare. Take the ratio between guide/loop and then find the difference between the two conditions for the ratios. (if the ratio is close to 1) (then the transcript is being processed) but if the ratio is <<1 then there is no processing because loop is more and guide is less. Heatmap to cluster miRNAs on ratios. T-test or similar test to see how each condition (if ) is different from the other. **(Working on this Jan 30th)... Need input from Susi to continue Time Spent today: 4 hours (NEW TIME LINE: needs 6 hours, to be completed by Feb**

Aim 4) Long RNA-Seq analysis. HEAT map of all differentially expressed
 * Do the differential expression analysis for both condition on mRNA (long RNA)
 * KD vs MuP53 just simple diff expression
 * I need to create the read count file from the output files

Note frm Susi email: I have also attached the raw of my iTRAQ experiment. It was run with the same samples from the RNA seq experiment. However, this time I only have two replicates and two different time points for each replicate (2d and 4d after knockdown of mut p53).

Part 2) Quality Control Analysis for the already aligned reads and the aligned read files. **Estimated completion date (Feb st)** Part 3) ITRAQ protein data analysis: This has not been done before by anyone for this kind of data. Novel analysis. Possible first author publication as suggested by Susi. **Estimated completion date: N/A depends on when I get data but will take 2 months after that at least**.
 * Since I had, not done the initial analysis for this for Susi, I need to do a QC analysis for each sample and find out how reliable the data is. **(8 hours)**
 * I require the original NON-normalized data to start this analysis. Susi needs to get back to me about this data.
 * Figure out how the analysis is done and do the analysis (56 hours)
 * I need non normalized data before proceeding on the analysis of this data.
 * Data link send by susi: Will download it on **Feb 1st 10:00 am**.
 * Meeting with Susi on **Wednesday Feb 1st 9:00 am**. TIME SPENT: 1: hour 15 mins. To discuss the current project and clarify two issues:
 * FPKM vs Read counts: We decided that the read counts are OK!
 * We do, however, need to check the quality of the data to figure out how good the experiment is because seems like there is some noise.
 * What to do with loop vs Guide ratio since guide have 5p and 3p values. Which one to use? USE BOTH.1
 * **TO BE DELIVERED TODAY Feb 1st:**
 * **Initial QC of data (5 hours) : Miscalculated the time. New time required(12 hours) New Due Date, Feb 9th. There are following ISSUES:**
 * **We do not have the raw read files. Need to get those from Toms lab:** Today, Feb 2nd, 2012: Update: We now have the files.
 * You can find the fastq files here /data/gingeras/scratch/zaleski/mouse_trin_fastq/
 * **This will be fed into FASTX toolkit which can generate the statistics for the RAW reads per experiment.**
 * **In order to figure out replicate correlation, we need to see if the bam file has mismatched reads or not. We will do this in Samtools.**
 * **Once we have the correct BAM file, we will use a GTF (which we have to create for miRNA's)**
 * **Once we have the GTF, we will take the BAM--> SAM file and use that with the GTF to get a table of read counts for each miRNA**
 * **This table can then be used for further correlation analysis or PCA etc.**
 * **Heat map scale (1 hour): Total time taken: 5 hours. This took longer than expected. We needed to rerun the analysis without the miRNA's with < 5 reads per lane.**
 * **Try to finish the guide loop (but I am not hopeful that I will have time for that)**
 * Feb 2nd 2012: Susi asked for a list of sorted Differentially expressed miRNA's in both guide and loop. I sent the list to her. Time required 30 mins
 * For susi to do:
 * Try DAVID website for pathway analysis
 * Problem with Ingenuinity: Does not recognise MicroRNA name. Look at manual for miRNA analysis
 * **FUTURE PLAN For Itraq Data: To be completed by: Feb 28th (4 weeks)**
 * SUSI: Clean up the data as much as you can
 * Remove any rows that have dashes for BOTH excel files.
 * Send both clean excel files
 * For EACH miRNA that has come up as significant, we will pull out the Proteins from TargetScan website.
 * Then we will take the intersection of these proteins with the ITRAQ data in either direction (up/down in muP53/KD) and see if we find any overlap.
 * Vishal:
 * Take the clean files. Sort the data on the MATCH column and remove further the rows that have less than 10% match. (Suggested by Susi on Darril's recommendation)
 * Do a ratio of interested columns A/b/C/b to get the ratio of A/C.
 * Also need to Look and FIND out about ratio statistical significance.
 * NEXT MEETING: FEB 7th 2012 9:30 am
 * For EACH miRNA that has come up as significant, we will pull out the Proteins from TargetScan website.
 * Then we will take the intersection of these proteins with the ITRAQ data in either direction (up/down in muP53/KD) and see if we find any overlap.
 * Meeting Feb 8th at 10:00 am.
 * EMAIL correspondence:
 * I have uploaded the excel spreadsheet as discussed yesterday to the server (Vishal->For Susi->miRNA and iTRAQ). The first spreadsheet contains the predicted mRNA targets for each miRNA. The second spreadsheet contains the iTRAQ data. I guess I need to explain how the latter is set up - The spreadsheet contains 2 files, in which the data has been normalized to different replicates. We need to analyze the "4d" samples only.
 * MEETING NOTES: Feb 8th 2012, time 10:10 am
 * Our hypothesis is that when miRNA is down, the protein levels for the protein the miRNA targets should be UP.
 * or if the miRNA is UP, then the protein it targets is DOWN.
 * We need to test this with our miRNA data and ITRAQ protein data. This is how we will do it:
 * We have 2 files.
 * ITRAQ proteins and their values for 1309 day 4 (both reps) and 1224 (KD) day 4 rep 5
 * 2nd part, 1309 both reps and 1224 day4 rep12 This will be a separate analysis from above. Do all the below steps for these both.
 * We have a miRNA prediction file from TARGETSCAN. This contains GeneID for miRNAs that are UP or Down in muP53 compared to KD
 * We need to combine the 2 protein files to get a file that has all proteins. (4 hours) **DONE Feb 8th.**
 * We will now combine the 2 data sets on the GENE ID field so that we get all the proteins that are expressed and are common between miRNA predicted genes and ITRAQ gene--protein dataset ( 4-5 hours) (For one, for second, 1 hour)
 * Next once we have the common genes / protiens, we will take the ratio of 1309/4d with 1224 day 4/rep5 for both replicates of 1309. (4 hours)
 * According to our hypothesis above, we should see that miRNAs that are up should have proteins whose ratio is down and vice versa. Lets see!
 * Work Progress: Monday 2/13/2012
 * Create a Long RNA file from FPKM files (1 hour) Took 2 hours (The Files are SO large, took a LONG time to load into memory and copy)
 * Work Progress Wednesday 2/15
 * Created the NEW file with combined read counts for all transcripts. Took 4 hours. Re-running the program.