Winter 2018: Introduction to Bioinformatics and Computational Biology

Schedule: Tuesday and Thursday 9:30-10:50am

Location: TTIC Room 501 on the 5^th floor, 6045 S Kenwood Ave, Chicago, IL 60637

Instructor: Jinbo Xu (jinboxu@gmail.com, office: TTIC Room 528)

Students can register this course through the University of Chicago.

With availability of a large-scale of genomic, expression and structural data, mathematics/statistics/computer science is being extensively used for the understanding of biology at the molecular level. This course focuses on the application of machine learning and computer algorithms to the problems in the field of molecular biology. In particular, this course covers some fundamental computational molecular biology problems including sequence alignment, homology search, RNA/protein structure analysis and prediction, gene expression, biological network analysis and next-generation sequencing.

Students are highly encouraged to read the following materials before attending this class since they will not be covered in the class.

1. The Department of Energy's Primer on Molecular Genetics.

2. The Department of Energy's Overview of the Human Genome Project.

3. Hunter's molecular biology for computer scientists.

4. National New Biology Initiative: A New Biology for the 21st Century.

Syllabus

Here is a syllabus for this course. A temporary reading list is available here.

Intended Audience

Graduate students or senior undergraduate students with Math/CS/statistics/biology background. To be able to finish the assignments and the final research project, students shall do some programming using C++, Java, Matlab, Python, R or other scientific computing software.

Evaluation

There will be no examination for this course. The final grade is derived from three components: homework, one final research project and attendance. For the homework assignments, you shall re-implement a popular algorithm or conduct an experiment to compare several popular bioinformatics tools and summarize your work in a technical report (around 5 pages). There will be four homework assignments, which in total account for 60% of the final grade (each accounting for 15%). The final research project requires you to develop some new algorithms or conduct a critical review for a specific bioinformatics problem. You are not required to come up with extremely innovative ideas, although it is highly encouraged. Incremental improvement over existing algorithms is acceptable for the final research project. Please hand in a report of the final research project. The final project accounts for 30% of the final grade. All the students are required to finish both homework and the final research project. However, undergraduate students will be marked more generously. The students have to attend the class to earn the remaining 10%.

Homework Assignments

Homework 1 (due date Jan 21, 2018):

a. Implement the local and global alignment models for pairwise DNA sequence alignment using the programming language you are good at. DO NOT copy from open source codes.

b. Test your code using one of the four benchmarks (BaliBASE, OXBench, PREFAB or SMART) at http://dna.cs.byu.edu/mdsas/download.shtml, which has the ground truth alignments. Each file in the benchmark contains one multiple sequence alignment, from which you may extract pairwise alignments. The input sequences for your program shall not contain any gaps (i.e., “-” in the ground truth alignment). Calculate the alignment accuracy of your code by comparing its resultant alignments with the ground truth.

c. Compare the alignment accuracy of the local and global alignment models.

d. Study the impact of gap penalty. You may test two different gap extension penalty scores (0 and -1) and three different gap open penalty scores (0, -5, and -10).

Homework 2 (due date Jan 31, 2018):

Develop scoring matrices for amino acid similarity based upon the reference alignments in the SABmark1.65 datasets, using the method BLOSUM is derived.

1) The SABmark benchmark contains two datasets, one for superfamily level and the other for twilight zone. Please derive two different scoring matrices, one from each dataset.

2) Compare your two scoring matrices and explain their similarity and difference.

3) Compare your two scoring matrices with the five BLOSUM matrices listed at https://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/source/data/. Find out which BLOSUM matrices are the closest to your scoring matrices.

Homework 3 (due date Feb 15, 2018):

a. Please download the following protein sequences:

1) http://www.predictioncenter.org/casp12/target.cgi?target=T0859&view=sequence

2) http://www.predictioncenter.org/casp12/target.cgi?target=T0860&view=sequence

3) http://www.predictioncenter.org/casp12/target.cgi?target=T0861&view=sequence

4) http://www.predictioncenter.org/casp12/target.cgi?target=T0862&view=sequence

5) http://www.predictioncenter.org/casp12/target.cgi?target=T0863&view=sequence

b. Go to the NCBI BLAST web site and run Protein BLAST on each of the above sequences using different scoring matrices and word size

c. In your homework report, for each sequence please study the similarity and dissimilarity of the results returned by BLAST when the scoring matrix and word size are changed and summarize your findings. Please report how the E-value distribution changes with respect to the scoring matrix and word size.

Homework 4 (due date Feb 22, 2018):

a) Learn to use two protein secondary structure prediction web servers: 1) SPIDER2: http://sparks-lab.org/server/SPIDER2/; and 2) RaptorX-Property: http://raptorx.uchicago.edu/StructurePropertyPred/predict/ .

b) To test them, please download the sequence and structure files (in FASTA and PDB format respectively) of the following proteins from the Protein Data Bank website: 4ympA, 5a7dB, 5a7dL, 5aotA, 5ereA, 5fjlA, 5j4aA, 5j4aB, 5j5vA, 5j5vB, 5j5vC, 5jmbA, 5jmuA, 5kkpA and 5ko9A. Meanwhile, the first 4 letters encode a PDB ID and the last letter is the protein chain name. For example, by searching 4ymp at http://www.rcsb.org/pdb/home/home.do, you will be directed to the web page of the protein 4ymp. Clicking on the “Download” button, you will be able to download the FASTA sequence and the PDB files. Note that you only need to download the sequence file for chain A.

c) Submit your sequences to the web servers and then calculate the secondary structure prediction accuracy (i.e., Q3 accuracy) of the returned results. To obtain the ground truth, you may use the DSSP web server or program at http://www.cmbi.ru.nl/dssp.html. You will need to write a small program to parse the output of DSSP.

d) Compare your Q3 accuracy with that listed in Table 1 of the paper at https://academic.oup.com/bib/article/doi/10.1093/bib/bbw129/2769436/Sixty-five-years-of-the-long-march-in-protein to see how different your result is from this table. Only Q3 accuracy is needed.

e) If you do not want to use web servers, you can of course install their standalone packages. But generally speaking, it needs much more effort to install a local copy of the packages.

Research Projects

Below are some optional topics for the final research project. You are also encouraged to propose your own topics, but the topics you choose shall be different from the homework topics. If your project is about algorithm development, I expect that you have some incremental improvement over currently popular ones. If your project is review of a specific topic, please discuss future research directions, in addition to critically review existing work.

1. Review algorithms for protein secondary structure prediction

2. Review deep learning algorithms for protein local structure (secondary structure, solvent accessibility and torsion angles) prediction

3. Review algorithms for protein contact prediction

4. Review deep learning methods for drug discovery

5. Review algorithms for protein structure alignment

6. Review algorithms for phylogeny tree construction

7. Review algorithms for biological network alignment

8. Review algorithms for biological network construction

9. Develop algorithms for pairwise or multiple protein-protein interaction network alignment

10. Develop deep learning algorithms for protein secondary structure and/or torsion angle prediction

11. Develop deep learning algorithms for protein solvent accessibility and/or contact number prediction

12. Develop deep learning algorithms to predict if a protein is DNA or RNA-binding protein

13. Develop deep learning algorithms to predict if two proteins interaction or not

The due date of the final project is March 11, 2018. Please send me a brief abstract (one paragraph) to tell me what want to work on before mid in Feb 2018. If you need your final grade to graduate, please talk to me and hand in your final project and homework earlier (by March 4, 2018). If you need more time to complete the research project, please also talk to me.