This tutorial is designed to get hands-on experience with two powerful sequence alignment tools: BLAST (Basic Local Alignment Search Tool) and HMMER. You will learn how to run these tools from the command line, understand their output, and see how to combine them for deeper sequence analysis.
Install BLAST and HMMER:
Please login to the CREATE note
srun -p cpu --time=03:00:00 --pty /bin/bash # use your assigned partition instead of cpuNavigate to your scratch space
/scratch/users/kYOURNUMBER/Using Conda to install BLAST and HMMER:
module load anaconda3/2022.10-gcc-13.2.0
conda create -n alignments
conda init bash
source ~/.bashrc
conda activate alignments
conda install -c bioconda blast hmmer seqtk mafft alvIf you don’t have conda, you need to first install conda. See instructions here
Prepare Your Workspace:
Create a new directory to store the files for this tutorial:
mkdir alignments
cd alignmentsBefore running a BLAST search, it’s important to have a database against which to compare your query sequence. In this section, we will build a local BLAST database using Uniref50 sequences, see Uniref documentation. The sequences were downloaded for you but you can access them here.
Obtain Sequences for the Database:
For this example, let’s look a FASTA file named
uniref50.fasta:
less /scratch/grp/msc_appbio/alignments/uniref50.fasta Q: How many sequences are in the database? Answer: 66527032
Building the BLAST Database:
The following the makeblastdb command creates a
BLAST database from the FASTA file. Please don’t run
it, it takes about 20 min to process all sequences but it will
probably kill the cluster’s filesystem if everyone will start running
the job as the job is highly depending on I/O operations.
#makeblastdb -in /scratch/grp/msc_appbio/alignments/uniref50.fasta -dbtype prot -out uniref50_dbOptions Explained:
-in /scratch/grp/msc_appbio/alignments/uniref50.fasta:
The input FASTA file.-dbtype prot: Specifies that the database is protein
sequences.-out uniref50_db: The name of the output database.To save time, I’ve already created database for you, instead copy it to your folder and unpack it:
cp /scratch/grp/msc_appbio/alignments/uniref50_db.tar .
tar xvf uniref50_db.tar BLAST is commonly used to identify sequences that are similar to a query sequence in a database. In this section, we will conduct a BLAST search using a protein query sequence.
Obtain a Query Sequence:
For this example, let’s use a sample FASTA file named
query.fasta. You can create it using:
nano query.fastaPaste the following sequence (this is PETase from Ideonella sakaiensis, Yoshida et al 2016, Science):
>A0A0K8P8E7
MQTTVTTMLLASVALAACAGGGSTPLPLPQQQPPQQEPPPPPVPLASRAACEALKDGNGDMVWPNAATVVEVAAW
RDAAPATASAAALPEHCEVSGAIAKRTGIDGYPYEIKFRLRMPAEWNGRFFMEGGSGTNGSLSAATGSIGGGQIA
SALSRNFATIATDGGHDNAVNDNPDALGTVAFGLDPQARLDMGYNSYDQVTQAGKAAVARFYGRAADKSYFIGCS
EGGREGMMLSQRFPSHYDGIVAGAPGYQLPKAGISGAWTTQSLAPAAVGLDAQGVPLINKSFSDADLHLLSQAIL
GTCDALDGLADGIVDNYRACQAAFDPATAANPANGQALQCVGAKTADCLSPVQVTAIKRAMAGPVNSAGTPLYNR
WAWDAGMSGLSGTTYNQGWRSWWLGSFNSSANNAQRVSGFSARSWLVDFATPPEPMPMTQVAARMMKFDFDIDPL
KIWATSGQFTQSSMDWHGATSTDLAAFRDRGGKMILYHGMSDAAFSALDTADYYERLGAAMPGAAGFARLFLVPG
MNHCSGGPGTDRFDMLTPLVAWVERGEAPDQISAWSGTPGYFGVAARTRPLCPYPQIARYKGSGDINTEANFACA
APPRunning a BLAST Search:
Use the BLASTP program to search against the custom database we just created:
blastp -query query.fasta -db uniref50_db -out blast_results.txt -evalue 1e-5 -outfmt 6Options Explained:
-query query.fasta: The query sequence file.-db uniref50_db: The uniref50_db database to search
against.-out blast_results.txt: Output file for the
results.-evalue 1e-5: Set the E-value threshold for reporting
matches.-outfmt 6: Output in tabular format (easy to
parse).Reviewing the Output:
Open the results file:
less blast_results.txtThe output format 6 is a tab-delimited table with the following fields: query ID, subject ID, % identity, alignment length, mismatches, gap opens, query start, query end, subject start, subject end, E-value, and bit score.
Q: how many significant E-value < 1e-6 are out there? Answer: awk ‘$11 < 1e-6’ blast_results.txt | wc -l
Extract significant hits with E-value < 1e-6 a and identity >40%
awk '$3 > 40 && $11 < 1e-6' blast_results.txt > significant_blast_hits.txtHMMER is a tool used for searching sequence databases with profile hidden Markov models (HMMs). It is particularly useful for finding more remote homologs compared to BLAST.
Preparing a Multiple Sequence Alignment (MSA):
To create a profile HMM, you need an MSA file in FASTA format.
Let’s start first from extracting identified sequences using very useful
and lightweight seqtk utility for sequence manipulation.
You can find more info about seqtk here. First extract the hits
IDs.
cat significant_blast_hits.txt | cut -f2 > significant_blast_hits.ids
Then filter out from huge database file
seqtk subseq /scratch/grp/msc_appbio/alignments/uniref50.fasta significant_blast_hits.ids > blast_hits.fastaNow let’s aligned the extracted sequences, using MAFFT.
mafft blast_hits.fasta > hits_alignment.fastaYou can use to alignemnt viewer tools like Jalview to visualise the alignments.
We will use handy alv utility to
view alignments in terminal.
alv -k alignment.fasta | less -RBuilding a Profile HMM:
Once you have the MSA, use the following command to build a profile HMM:
hmmbuild profile.hmm alignment.fastaOptions Explained:
profile.hmm: The output HMM profile file.alignment.fasta: The input MSA file.Searching with HMMER:
Use hmmsearch to search for sequences that match
your profile HMM in a target database. In our case as a database we will
use metagenome sample collected as a part of Tara Ocean Project.
Let’s first access the sample data, stored at MGnify
database. We are looking at the study MGYS00002008, sample ERS494579,
assembly ERZ841232. To safe time, the file is already dowloaded for you
using the following command, Please don’t run:
#wget https://www.ebi.ac.uk/metagenomics/api/v1/analyses/MGYA00590504/file/ERZ841232_FASTA_predicted_cds.faa.gz
Instead let’s search the file directly using produced
profile.hmm
The command is:
hmmsearch --tblout hmm_results.txt profile.hmm /scratch/grp/msc_appbio/alignments/ERZ841232_FASTA_predicted_cds.faaOptions Explained:
--tblout hmm_results.txt: Save output in a tabular
format.profile.hmm: The HMM profile that you made.RZ841232_FASTA_predicted_cds.faa: The database file to
search.Reviewing the Output:
Open the results file:
cat hmm_results.txtThe output includes fields such as target name, accession, query name, E-value, and score.
In this tutorial, you learned how to use BLAST and HMMER from the command line for sequence similarity searches and how to combine the results to identify significant homologs. BLAST is fast and efficient for detecting close homologs, while HMMER excels at finding more distant relationships due to its use of profile HMMs.
Feel free to reach out with any questions or if you need further guidance on the exercises!