Scott

**Projects and timeline:** ######### Hours represent actual work time vs time spent at the work place #######
 * RNA-Seq Analysis //already// done for Zhen is available at : http://thapar.wikispaces.com/Zhen
 * Work remaining overlaps in the points below. Estimated completion data for her to have candidates for validation: Feb 1st
 * RNA-Seq analysis to be done again in R for double checking as well as creation of pipeline for all RNASEQ analysis for Lowe Lab on command prompt. Estimated time of completion March 31st 2012. This pipeline will be run from the SHELL in Unix.
 * Step 1) Take the initial aligned files using Tophat and then using a program like HTSEQ AND Cufflinks find out the read counts in each rep. Make sure that you use the same GTF file for both.
 * GTF file can be downloaded from UCSC table genome browser (1 hour)
 * Time that it will take to write scripts: 4 hours
 * Time required to install HTSEQ / Python libraries, dependencies, programs needed to run the entire pipeline on MAC as well as linux server in parallel. Getting the directory, data transfer to server, getting directory ready (48 hours)
 * Time for Tophat, Cufflinks install: 2-5 hours (dependencies must be installed)
 * Time required for actual running: 24 hours
 * Step 2) In order to compare the results between EDGER, DESEQ and Cuffdiff, run the differential expression for all 3 so as to pick which one is best?
 * Write scripts in R for edgeR, deSEQ (2 hours)
 * Write shell scripts to run Cuffdiff (2 hours)
 * Run the two analysis (2 hours)
 * Do a comparison analysis between the 3 programs with their outputs, generate all graphs and overlaps of significant genes (16 hours)
 * Step 3) Once the comparison is done, Combine all the programs for each step into 2-3 small programs that will handle the entire analysis
 * Write a MAIN shell script that calls all other shell scripts.
 * Convert each script in R to be run from shell as well as combine them into a unified R script (24 hours)
 * Step 4) Test the written pipeline for accuracy (Testing)
 * Take the entire pipeline and run it to see how well it runs and how much time it takes. (24 hours)
 * Compare the results to actual results and verify (4 hours)
 * Susi Data analysis : Details at : http://thapar.wikispaces.com/Susi
 * Part 1) Sequencing Data: Estimated Completion Date Feb 9th (Lab meeting for Susi)
 * Analyze the current data for 4 specific aims.
 * Aim1) W e need to just take the GUIDE mi-rna and compare between two condition to see which ones are most expressed? Significance. Heatmap to show the difference. (4 hours) (DONE)
 * Aim 2) Loop gives the expression values of the Pre-MiRNA. Do the same above analysis with Loop now. (2 hours) DONE
 * Aim 3) Ratio of GUIDE / Loop tell you that DICER is working. Remove read lanes with less than 10 total. Then take the one row for guide between 3p and 5p, the row that has the maximum reads mapped. Now compare. Take the ratio between guide/loop and then find the difference between the two conditions for the ratios. (if the ratio is close to 1) (then the transcript is being processed) but if the ratio is <<1 then there is no processing because loop is more and guide is less. Heatmap to cluster miRNAs on ratios. T-test or similar test to see how each condition (if ) is different from the other. (16 hours)
 * Aim 4) Long RNA-Seq Analysis of mRNA data: This needs to be clarified and discussed. Estimated time required (8 hours)
 * Part 2) Quality Control Analysis for the already aligned reads and the aligned read files. Estimated completion date (Feb 9th)
 * Since I had, not done the initial analysis for this for Susi, I need to do a QC analysis for each sample and find out how reliable the data is. (8 hours)
 * Part 3) ITRAQ protein data analysis: This has not been done before by anyone for this kind of data. Novel analysis. Possible first author publication as suggested by Susi. Estimated completion date: N/A depends on when I get data but will take 2 months after that at least.
 * I require the original NON-normalized data to start this analysis. Susi needs to get back to me about this data.
 * Figure out how the analysis is done and do the analysis (56 hours)
 * UPDATE SO FAR: today is **Jan 31st 2012** We begin the initial part of this analysis **on Feb 1st**
 * Data is available to download from CSHL, download data, time required: **15 mins. Due date: Feb 1st.**


 * Agustin Data analysis for the EZh2 Story: Status: ** Completed ** . Time spent (160 hours) : Please see details at : http://thapar.wikispaces.com/ezh2
 * What has been done so far:
 * Microarray analysis:
 * Scripts in R
 * Get a list of differentially expressed genes
 * Chip-Seq analysis:
 * Quality Control of data
 * Alignment of reads using Bowtie
 * Statistical analysis with 2 different approaches
 * Using MACS: Published and clearly established as the best differential expression package for Chip-Seq
 * Using edgeR: R based package for differential expression
 * Combine both results to look for overlap
 * Presentation of results by making graphs
 * Pathway analysis using DAVID


 * Galaxy: On going project with Mskcc Bioinformatics: Nick Socci, Yupu Liang, Joanne Edington.
 * Create Micro Array analysis support on Galaxy (Project Completed) Time spent 54 hours, Mostly on meetings and problem solving with Bioinformatics
 * As of now Galaxy is available for use for Lowe Lab Members at http://galaxy.cbio.mskcc.org/
 * Create Sequencing Data analysis support on Galaxy:
 * While I do NOT recommend doing sequence analysis on Galaxy for various reasons, there is some basic support for analysis of Next Gen data on Galaxy.
 * Support is there for Quality control of data and some initial FASTQ analysis
 * To be added: Tophat, Cufflinks, Cuffdiff, EdgeR, deSEQ installation on Galaxy. Need to work and coordinate with Bioinformatics shared resource (BSR) for this. time : (10 hours meeting time)
 * Coding / installation time (80 hours)
 * shRNA prediction (time : 4 hours on weekends, 2 hours per week)
 * Vishal's current algorithm works with a accuracy of about 40%. Claudio's test reported an accuracy of 60%. 3/5 hairpins had good knockdown. For biological evaluation, I have predicted Hairpins with my algorithm for 6 genes that Luke had sent. The hairpins will be designed for these along with Simon's predictions for a combined test.
 * A new view point on this algorithm has been acquired from discussions with Raphael from Christina Leslie's lab. A new way to normalize the data has led to good initial results with an accuracy of 70% for the initial tests. I am working closely on this to have an algorithm ready for Lowe lab really soon. We might even be able to generate hairpins for the above genes for testing about to go out next week. (Today is Jan 28th, 2012)
 * Write a script to convert 22mer GUIDE strand sequence into 97mer sequence for the VECTOR. **Time required 3-4 hours. Due date: Feb 1st. Anticipated completion date, Jan 31st.**
 * Data mining for TCGA data (with Reileen, post doc from Chris Sander lab) Estimated time of completion Feb 14th for INITIAL PROTOTYPE data set. Total time per week 4 hours
 * I have manages to acquire segmented data from Reileen for 200 samples from the TCGA data. (time spent : 4 hours over Lunch time in 4 weeks)
 * Next step is to Run data mining algorithms on this data to come up with co-occurrence statistics. (Estimated time required: 24 hours)
 * Finally, after presentation of this data to Scott and Chris and having approval from both, we can go ahead and do similar runs for all data sets in TCGA (Estimated) time required 80 hours
 * Coordinating IT related Stuff:
 * Back up of CSHL data. Finally data has arrived. Will be transferred when IT at MSKCC takes over from me. Meetings / Email time spent: 16 hours over last 6 months
 * Updating and preparing this document on a daily basis: Time required (1 hour/day)
 * Meeting / Emailing with lab members to provide project updates: 2-3 hours / week

############

1. EZh2 as a tumor supressor 2. shEzh2 transgenic

Talk to Alex Krasnitz, about associations that he worked with Jose same chromosome, point mutations shouldnt matter but Meeting with Chris Sander: What we have tried to do - Scott
 * 1) co-occurrence of amplifications and deletions
 * 2) causative co-occurrence of genes not just correlated

They have done, tools Remind on labmeeting about wiki Talk to Ozlem