Winter 2017: Topics in Bioinformatics

Schedule: Monday and Wednesday 9-10:20am

Location: TTI-C Room 526 on the 5th floor, 6045 S Kenwood Ave, Chicago, IL 60637

Instructor: Jinbo Xu (jinboxu@gmail.com, office: TTI-C room 528)

 

Students can register this course through the University of Chicago.

 

 

With availability of a large-scale of genomic, expression and structural data, mathematics/statistics/computer science is being extensively used for the understanding of biological data at the molecular level. This course will focus on the application of machine learning and computer algorithms to the problems in the field of molecular biology. In particular, this course will cover a few topics such as sequence alignment and homology search, RNA/protein structure prediction and biological network analysis.

 

Students are highly encouraged to read the following materials before attending this class since they will not be covered in the class.

1. The Department of Energy's Primer on Molecular Genetics.

2. The Department of Energy's Overview of the Human Genome Project.

3. Hunter's molecular biology for computer scientists.

4. National New Biology Initiative: A New Biology for the 21st Century.

Syllabus

Here is a syllabus for this course.  A temporary reading list is available at here.

Intended Audience

 

Graduate students or senior undergraduate students with some Math/CS/statistics/biology background. To be able to finish the assignments and the final research project, students shall do some programming using C++, Java, Matlab, Python, R or other scientific computing software.

Evaluation

 

There will be no examination for this course. The final grade consists of three components: homework, one final research project and attendance. For the homework assignments, you may re-implement a popular algorithm or conduct an experiment to compare several popular bioinformatics tools and summarize your work in a technical report (around 5 pages). The homework assignments will account for 60% of the final grade. For the final research project, you may develop some new algorithms for a specific bioinformatics problem or conduct a comprehensive review on a specific research topic such as deep learning for protein structure prediction. You are not required to come up with extremely innovative ideas, although it is highly encouraged. Incremental improvement over existing algorithms is acceptable for the final research project. Please hand in a report of the final research project. The final project accounts for 30% of the final grade. All the students are required to finish both homework and the final research project. However, undergraduate students will be marked more generously. The students have to attend the class to earn the remaining 10%.

Homework Assignments (to be updated)

1.      Homework 1:

a.       Please implement the three alignment models taught in the class using whatever programming languages you are good at. You may compare your coding with some open source codes, but please do not copy from open source codes. You just need to do pairwise alignment.

b.      Please test your algorithms using some data in the SABmark benchmark. Please calculate the alignment accuracy by comparing your alignments with the ground truth in the benchmark. SABmark has proteins similar at both superfamily level and twilight zone. Please test your algorithms using proteins in both categories.

c.       Please study the impact of different alignment models.

d.      Please study the impact of different scoring matrices (BLOSUM and PAM).

e.       You may fix the gap extension penalty score to -1 and search for the best gap open penalty score.

 

 

2.      Homework 2:

a.       Please download the following protein sequences:

1)      http://www.predictioncenter.org/casp12/target.cgi?target=T0859&view=sequence

2)      http://www.predictioncenter.org/casp12/target.cgi?target=T0860&view=sequence

3)      http://www.predictioncenter.org/casp12/target.cgi?target=T0861&view=sequence

4)      http://www.predictioncenter.org/casp12/target.cgi?target=T0862&view=sequence

5)      http://www.predictioncenter.org/casp12/target.cgi?target=T0863&view=sequence

 

b.      Go to the NCBI BLAST web site and run Protein BLAST on each of the above sequences using different scoring matrices and word size

 

c.       In your homework report, for each sequence please study the similarity and dissimilarity of the results returned by BLAST when the scoring matrix and word size are changed and summarize your findings. In particular, please report how the E-value distribution changes with respect to the scoring matrix and word size.

 

3.      Homework 3:

a.       Please download the seed (multiple sequence) alignments of the top 20 Pfam families from the Pfam web site. The top 20 families are listed at http://pfam.xfam.org/family/browse?browse=top%20twenty

b.      Implement the simple algorithm covered in the lecture to build a Profile HMM from a multiple sequence alignment. You may use whatever programming language you are familiar with.

c.       Test your program on the multiple sequence alignments of the top 20 Pfam families and compare your results with what you can obtain by running HMMER3 (http://hmmer.org/). HMMER3 has a program hmmbuild that can build an HMM from a multiple sequence alignment. Please compare your HMM with the output of hmmbuild in terms of the number of match states, the state transition probability and the emission probability. Please try to explain the difference if there is any.

d.      Please hand in your source code (with some comments so that I can understand) and your results.

 

4.      Homework 4:

a.       In this homework, please learn to use two protein secondary structure prediction web servers: 1) SPIDER2: http://sparks-lab.org/server/SPIDER2/; and 2) RaptorX-Property: http://raptorx.uchicago.edu/StructurePropertyPred/predict/ .

b.      To test them, please download the sequence and structure files (in FASTA and PDB format respectively) of the following proteins from the Protein Data Bank website: 4ympA, 5a7dB, 5a7dL, 5aotA, 5ereA, 5fjlA, 5j4aA, 5j4aB, 5j5vA, 5j5vB, 5j5vC, 5jmbA, 5jmuA, 5kkpA and 5ko9A. Meanwhile, the first 4 letters encode a PDB ID and the last letter is the protein chain name. For example, by searching 4ymp at http://www.rcsb.org/pdb/home/home.do, you will be directed to the web page of the protein 4ymp. Clicking on the “Download” button, you will be able to download the FASTA sequence and the PDB files. Note that you only need to download the sequence file for chain A.

c.       Please submit your sequences to the web servers and then calculate the secondary structure prediction accuracy (i.e., Q3 accuracy) of the returned results. To obtain the ground truth, you may use the DSSP web server or program at http://www.cmbi.ru.nl/dssp.html. You will need to write a small program to parse the output of DSSP.

d.      Compare your Q3 accuracy with that listed in Table 1 of the paper at https://academic.oup.com/bib/article/doi/10.1093/bib/bbw129/2769436/Sixty-five-years-of-the-long-march-in-protein to see how different your result is from this table. Only Q3 accuracy is needed.

e.       If you do not want to use web servers, you can of course install their standalone packages. But generally speaking, it needs much more effort to install a local copy of the packages.

 

You can use existing libraries or Matlab or Python to implement your algorithm. However, please clearly point out your contribution in your report. If you use other bioinformatics libraries, please pay more attention to result analysis.

Research Projects

Please choose one of the following topics. You are also encouraged to propose your own topics. However, it is better not to work on the same topic for both your homework assignments and your final research project. For the algorithm development projects, you do not have to obtain the state-of-the-art performance, but your performance shall not be too bad. For review projects, in addition to critically review existing work, please also discuss possible future research directions.

1.      Review algorithms for protein secondary structure prediction

2.      Review deep learning algorithms for protein local structure (secondary structure, solvent accessibility and torsion angles) prediction

3.      Review algorithms for protein contact prediction

4.      Review deep learning methods for drug discovery

5.      Review algorithms for protein structure alignment

6.      Review algorithms for phylogeny tree construction

7.      Review algorithms for biological network alignment

8.      Review algorithms for biological network construction

 

9.      Develop algorithms for pairwise or multiple protein-protein interaction network alignment

10.  Develop deep learning algorithms for protein secondary structure and/or torsion angle prediction

11.  Develop deep learning algorithms for protein solvent accessibility and/or contact number prediction

12.  Develop deep learning algorithms to predict if a protein is DNA or RNA-binding protein

13.  Develop deep learning algorithms to predict if two proteins interaction or not

 

The due date of the final project is early mid in March. Please send me a brief abstract (one paragraph) to tell me what want to work on before mid in Feb. If you need your final grade to graduate, please talk to me and hand in the final project earlier. If you need more time to complete the research project, please also talk to me.