A bioinformatika új világa

A bioinformatika új világa

Careers in Bioinformatics Dr. Matthew Cserhati (UNMC) Nebraska Wesleyan Phage Symposium April 15, 2016 Personal introduction MSc: biology, Eotvos Lorand University, Hungary BSc: University of Szeged, software engineering, Hungary PhD: biology, University of Szeged Post-doc: University of Nebraska-Lincoln University of Nebraska Medical Center Durham Research Center 1 Bioinformatics programmer Email: [email protected] Research responsibilities, projects NeuroAIDS XHTML, Java, Javascript, MySQL Jboss server Linux environment Next database development Generation Sequencing data generation Demultiplexing (index-based read sequence generation) Data transfer & storage Differential gene expression analysis Staphylococcus SNP detection and analysis In silico assembly and annotation of giant virus genomes (in collaboration with Nebr. Wesleyan) What is bioinformatics? A science which deals with the production, analysis, modelling, depiction and storage of biological data Biological data: sequence, gene expression value, 3D protein structure Analysis can be done with an algorithm, program/script or pipeline of different tools Storage in databases for restricted/public use Terms: In vitro (experimental system) Iivo (living system) In silico: analysis which is done in part or in whole using computational tools An interdisciplinary science Bioinformatics builds on: Biology: uses and analyses data mainly from molecular biology Computer science: programming, running programs, applications Mathematics, statistics: evaluation of results and algorithm development Some sub-disciplines within bioinformatics Data storage and retrieval (databases) Data analysis (genomics, proteomics, microarrays) Data curation and annotation (prediction tools) Structural bioinformatics (macromolecular 3D structures) Data storage and retrieval The NCBI (National Center for Biotechnology Information) database Most widely known and used database in bioinformatics and which contains millions of sequences Also contains millions of published papers (PubMed-PMC) Mainly biology papers Can do complex queries with it Sequence analysis tool (BLAST) Gene Expression Omnibus (GEO) NCBI stats (2016) RefSeq (experimentally validated seuqences) 58.5M protein sequences 13.7M transcripts (mRNA) 60.000 species Newly determined sequences are sent to NCBI prior to publication GenBank BLAST Basic Local Alignment Search Tool Basic function is to measure similarity between two sequences (nucleotide and/or protein) Same/similar number/% of bp, aa Length of alignment E-value (probability of getting similar alignment by chance) Otherwise used to compare a shorter query sequence with subject sequences in a database MySQL Most commonly used database language SQL: Structured Query Language Database design Data storage Data query Command line language like Linux Data stored in databases, data tables, columns, and rows A single database can have 20-1000 tables for one project Other well-known databases EBI: European Bioinformatics Institute Swissprot: protein database EMBOSS: bioinformatics software Transfac: regulatory motifs PATRIC: pathogenic interactions db UCSC Genome Browser Ensembl: genetic data JGI: curated db with genome, gene, protein sequences for different species https:// en.wikipedia.org/wiki/List_of_biological_databa ses Dedicated databases Data for one/few specific organisms Experimental systems TAIR: Arabidopsis genetics data Xenbase: frog (X. laevis) Wormbase (C. elegans) RGD: rat genome database SGD: Saccharomyces genome db FlyBase: D. melanogaster SNiPHunter: SNP db/human European Bioinformatics Institute (EBI) 4XT4 Data analysis Tools used in data analysis For those with background in genomics, proteomics, microarrays Operating system is usually Linux but also Windows Linux is used for precise calculations, and code

development RedHat, Centos Windows is used mainly for modelling Languages used in bioinformatics Data analysis languages: Matlab, perl, python, C, R (statistical functions) Modules: BioPerl, BioPython, Bioconductor Database languages: PHP (Laravel), Java, Javascript, jQuery (dynamic content) Data storage languages: MySQL, noSQL Modelling software: Cytoscape, Matlab Figure from paper constructed in R Ribosomal protein networks Figures from presentation constructed in CytoScape Linux Command line operating system similar to DOS Hierarchical folder system with permissions on files/directories Useful for running programs and storing files in a systematic way Not difficult to learn A lot can be done with 50 commands Many online guides Data curation and annotation Involves using algorithms in predicting biological structures E.g. functional annotation of genes in virus genome project Using CLC Genomics to predict ORFS in de novo (unguided) assembled virus genome Using blast to find homologous viral genes with same function Structural prediction programs to predict 3D structure of proteins Structural bioinformatics Deals with the prediction of 3D structures of biological macromolecules DNA, RNA, proteins Disciplines: biochemistry, biophysics Useful databases: Molecular Modeling database Protein Data Bank SCOP: Structural Classification Of Proteins SCOP 2 http:// scop2.mrc-lmb.cam.ac.uk/front.html Classifies proteins into folds, superfamilies, families More detailed structures at lower level of hierarchy E.g. b.1.12.1 - Purple acid phosphatase, N-terminal domain Emboss programs for structural prediction Nucleic 2d structure tool group Protein 2d, 3d structure tool group Nucleic RNA folding Protein domains, functional sites, modifications INBRE and the Guda lab at UNMC Thematic areas of research in Guda lab Institutional Development Award Program (IDeA) Networks of Biomedical Research Excellence (INBRE) program $17.2 million National Institutes of Health grant for Nebraska biomedical research infrastructure that provides research opportunities for undergraduate students pipeline for those students to continue into graduate research INBRE Bioinformatics Core Infrastructure development Research IT Infrastructure (hardware, software, storage) Bioinformatics Infrastructure (computer servers, databases, software tools) Services, data analysis and application development An array of data analysis Development of new methods to keep up with emerging technologies (metagenomics, single-cell NGS data analysis, etc.) Software applications, web-based tools Educational and training activities Multi-omics Journal club Summer workshop on bioinformatics List of publicly available Bioinformatics programs on INBRE server Affymetrix Annotation Converter BLAST BLAT BRB-Array Tools BioPerl Bioconductor Bowtie Clustal2 Ensembl Erlang FASTX-Toolkit Git Glimmer HMMER I-TASSER In-Silico PCR MATLAB MEME Suite MaxQuant Mfold Microarray Analysis in R Muscle PHYLIP PERL Modules R RiboSW SQLite Samtools Weka P ro p o rtio n S u rv iv in g Survival analysis of TCGA Glioblastoma patients Median: 345 days Std dev: 201 days Survival Curve Red: short-term survival group (med - 1 x std dev) Green: long-term survival (med + 1 x std dev) Blue: intermediate 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500

1000 1500 2000 Days 2500 3000 3500 4000 TCGA-Pancreatic Cancer Data from 450K Methylation data (n=174 tumors, 10 normal) Mishra and Guda (manuscript in preparation) 17 14 15 1 6 12 13 5 6




1 8

22 7 18 19 21 20 300 hypermethylated probes, 200 hypomethylated Hyper methylated Hypo methylated 9 10 11 National NeuroAIDS Tissue Consortium Database Cserhati et al, 2015 Assembly and annotation of large virus genomes Ten giant virus genomes assembled de novo from read sequences (~330 kbp) Paramecium bursaria Chlorella virus (PBCV) ORF discovery resulting in several hundred candidate gene sequences per strain ORF sequences tblastxd against known viral protein sequences Many new genes with unknown functions Giant viruses a new domain of life Possible functional annotation with 2D/3D Emboss programs The latest technology in Next Generation Sequencing Genome assembly of Neanderthal and Denisova in 2010 Low coverage (<5x) Denisovan tooth from cave in Siberia Nanopore technology https://www.youtube.com/watch?v=3UHw22hBpAk Summer Workshop on Bioinformatics Workshop taught by Kiran Bastola ([email protected] ) and Mark Pauley ([email protected]) at UNO Workshop Format Dates: July 2016 Four consecutive Fridays from 9am to Noon Taught at 276, PKI Four modules, one on each day Topics covered: Gquery Entrez Biological database search Vector NTI Vector NTI/Ingenuity Some useful links (hundreds of jobs) http://www.jobs.com/q-bioinformatics-l-nebraska-jobs http://www.iscb.org/iscb-careers-job-database (international level, good idea to be part of ISCB) http://jobs.sciencecareers.org/jobs/bioinformatics/ http://jobs.newscientist.com/jobs/bioinformatics/ (international) https://www.sciencemag.org/careers/features/2014/06/explosi on-bioinformatics-careers (paper with tips on how to apply for bioinformatics jobs) Acknowledgements INBRE Bioinformatics Core Personnel Babu Guda, PhD Ashok Mudgapalli, PhD Mike Gleason, PhD Sanjit Pandey, MS Support from Funding from INBRE Jim Eudy, PhD Genomics Core, UNMC Dr. Jim Turpen, UNMC Thanks for your attention!

Recently Viewed Presentations

  • Setting the Standard for Project Based Learning

    Setting the Standard for Project Based Learning

    project-based learning anyway?" I don't know why he would ask that, but for the purposes of this fantasy, it seems that any Joe-off-the-street is fascinated by your response. You respond accordingly: "PBL is the act of learning through identifying a...
  • Rulers of the Universe: Aristarchus Image Credit: Addison

    Rulers of the Universe: Aristarchus Image Credit: Addison

    When you observe the shadow of the gnomon cast by the Sun at any time, the Altitude of the Sun is the angle made between the shadow and the end of the shadow to the Sun. ... This measurement was...
  • ATMS 316- Mesoscale Meteorology  Packet#10  Interesting things happen

    ATMS 316- Mesoscale Meteorology Packet#10 Interesting things happen

    On the mesoscale, Terms II-V easily can be an order of magnitude larger than their synoptic-scale magnitudes. Other large-scale CAPE and CIN modifications. Mean large-scale ascent . always. reduces CIN. Dramatic increases in low-level moisture (Fig. 7.3)
  • Summer 2017 Science Hub meeting

    Summer 2017 Science Hub meeting

    It is not about revisiting content knowledge, but as a department identifying teaching strategies to target the areas your pupils performed less well in.
  • Meta-Analysis Matthew Burns University of Missouri Overview Resources

    Meta-Analysis Matthew Burns University of Missouri Overview Resources

    Random Effects Model. Articles are similar but vary somehow. There is a distribution of effect sizes. Combined effect. Fixed effects = the one common effect
  • Presentation Background 1 - dimetic.dime-eu.org

    Presentation Background 1 - dimetic.dime-eu.org

    A2)Firms have little information of the consumer demand function of their competitors HI) Firms imitate growth strategies of other firms within a cluster. HIa) Firms imitate growth strategies of well- performing firms within a cluster. Vocabulary Firm- a small high-...
  • Chapter 15: Reconstructing a Nation, 1865-1877 Presentation

    Chapter 15: Reconstructing a Nation, 1865-1877 Presentation

    The Consequences of Conquest. Chapter Overview. Great Migrations. The Emergence of Farming. The Cradle of the Americas . The Northern World Takes Shape. The World of the Indian Peoples Overview. The Reconquista. The Age of Exploration . New Ideas Take...
  • SATS 2016 - PrimaryBlogger

    SATS 2016 - PrimaryBlogger

    Pupils take a test (SATS) in reading and maths. The test results are used to inform teacher assessment. There is no test for writing, although there is a non statutory grammar, punctuation and spelling test.