We will analyze two genomes, one from prokaryote and one from eukaryote - Escherichia coli ATCC 25922 and Candida albicans A123 respectively. We will use two standard tools for finding genes, i.e. Prodigal (https://github.com/hyattpd/Prodigal) and Augustus (https://github.com/Gaius-Augustus/Augustus).
Try and navigate the links and obtain the data yourselves (raw sequence (contigs/scaffolds) without annotation). In this particular example we will work with NCBI database, alternative databases are European Nucleotide Archive (https://www.ebi.ac.uk/ena) that you might be using in you project. Try to download both to the server and your computer as well.
The link to the genome assembly:
https://www.ncbi.nlm.nih.gov/assembly/GCA_000401755.1/
Login to machine using ssh and create and enter the directory for the exercise, e.g. gene_prediction
mkdir gene_prediction; cd gene_prediction
Use wget or curl to download data to a remote machine:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/401/755/GCA_000401755.1_Escherichia_coli_ATCC_25922/GCA_000401755.1_Escherichia_coli_ATCC_25922_genomic.fna.gz
Alternatively, you can copy file from your computer. In gitbash/linux/mac use command:
scp yourfilename studentX@ip_adress:/home/studentX/gene_prediction
Unarchive file:
gunzip GCA_000401755.1_Escherichia_coli_ATCC_25922_genomic.fna.gz
Before proceeding, we will install several tools to predict protein coding genes and tRNAs in our E.coli genomee. We will use prodigal and tRNAscan-SE to predict all open reading frames of protein coding genes and tRNA genes respectively.
Before using it, we will create a new environment gene_prediction
install prodigal in our environment conda create --name gene_prediction
Activate the environment: conda activate gene_prediction
Install prodigal: conda install -c bioconda prodigal
conda install -c bioconda trnascan-se
Get yourself familiar with different Prodigal options, more information: https://github.com/hyattpd/Prodigal Find protein-coding sequences (CDS), familirase yourself with the output files:
prodigal -i GCA_000401755.1_Escherichia_coli_ATCC_25922_genomic.fna -o GCA_000401755.1_Escherichia_coli_ATCC_25922_genomic.gff -a GCA_000401755.1_Escherichia_coli_ATCC_25922_genomic.fasta -f gff
Get yourself familiar with tRNAscan-SE options (e.g. save output in GFF format). To predict genes in prokaryote run the following command:
tRNAscan-SE -B GCA_000401755.1_Escherichia_coli_ATCC_25922_genomic.fna -o GCA_000401755.1_Escherichia_coli_ATCC_25922_genomic.rna -a GCA_000401755.1_Escherichia_coli_ATCC_25922_genomic.rna.fasta
Try to find navigate yourself NCBI FTP site yourself and find GCA_000447455.1_Cand_albi_A123_V1 genome, Alternatively, download genome to remote computer by running the following command:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/Candida_albicans/all_assembly_versions/GCA_000447455.1_Cand_albi_A123_V1/GCA_000447455.1_Cand_albi_A123_V1_genomic.fna.gz
Unarchive file: gunzip GCA_000447455.1_Cand_albi_A123_V1_genomic.fna.gz
For eukaryotes it might take considerable time to run, instead copy the file from the existing precomputed directory (uncomment if you want to try running (takes ~10 min): ### augustus GCA_000447455.1_Cand_albi_A123_V1_genomic.fna --species=candida_albicans > GCA_000447455.1_Cand_albi_A123_V1_genomic.gff
wget https://raw.githubusercontent.com/zelezniak-lab/MPBIO-BBT045/master/GCA_000447455.1_Cand_albi_A123_V1_genomic.gff
Extract proteins from the generated gff file using the following command:
getAnnoFasta.pl GCA_000447455.1_Cand_albi_A123_V1_genomic.gff
Analogously like with prokaryotes you can use tRNAscan-SE
to find tRNAs in the Candida yeast genome. Please see command help to specify correct arguments.
First dowload database of sequences from https://www.uniprot.org/downloads. The sequences were already downloaded for you, just simply use the following command to copy them to your working directory:
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
We will use blastp
to blast your identified sequences to the database of Uniprot sequences, but before this we need to create a blast database (takes several hours to run).
makeblastdb -in uniprot_sprot.fasta -dbtype prot -out uniprot_database
###blastp -query GCA_000447455.1_Cand_albi_A123_V1_genomic.aa -db uniprot_database -outfmt 7 -out blast_results
`wget https://raw.githubusercontent.com/zelezniak-lab/MPBIO-BBT045/master/blast_results` # a snapshot of BLAST results file
Chalmers University of Technology 2021