AGRP - Asteraceae Genomic Research Platform

Blast

Reference: Basic local alignment search tool

Download: ncbi-blast-2.13.0+-x64-linux.tar.gz

Installation:

# Download via command line
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.13.0+-x64-linux.tar.gz

# Unzip
tar -zxvf ncbi-blast-2.13.0+-x64-linux.tar.gz

# Rename
mv ncbi-blast-2.13.0+-x64-linux blast

# Environment variable settings
# Edit the ~/.bashrc file and add the following line at the end:
export PATH=/home/local/Software/blast/bin:$PATH

# Configuration effective
source ~/.bashrc

Use Flow:

# 1. Makeblastdb (Format Database):
makeblastdb -in db.fasta -dbtype prot -parse_seqids -out dbname

# Parameter description:
# -in: the sequence file to be formatted
# -dbtype: database type, prot or nucl
# -out: database name
# -parse_seqids: parse sequence identifier (recommended to add)

# 2. Blastp (Protein vs Protein):
blastp -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8 

# 3. Blastn (Nucleic vs Nucleic):
blastn -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8 

# 4. Blastx (Nucleic vs Protein):
blastx -query seq.fasta -out seq.blast -db dbname -outfmt 6 -evalue 1e-5 -num_descriptions 10 -num_threads 8 

# Output Format (m8) Columns:
# 1. Query id        2. Subject id      3. Identity %
# 4. Alignment len   5. Mismatches      6. Gap openings
# 7. Q.start         8. Q.end           9. S.start
# 10. S.end          11. E-value        12. Bit score

Notice:

# When comparing the makeblastdb library with blast, pay attention to the parameters (dbtype), whether the protein file (prot) is used or the nucleic acid file (nucl).

Blast usage

1.Select the desired Blast subroutine Choose between blastp (Protein vs Protein) or blastn (Nucleotide vs Nucleotide) from the dropdown menu.

2.Upload the Target Database file Click to upload the library file you want to search against (must be in FASTA format).

3.Upload the Query Sequence file Click to upload the specific sequence file you want to analyze (must be in FASTA format).

4.Enter the E-value threshold Input the expectation value cutoff (default is 1e-5) to filter significant hits.

5.Input the Score Value Enter the minimum alignment score required (default is 100).

6.Input the Identity Value Enter the minimum percentage of identity required for a match (default is 60).

7.Click the "Run Blast" button Submit the form to start the homology search and alignment process.

Diamond

Reference: Fast and sensitive protein alignment using DIAMOND

Download: diamond-linux64.tar.gz

Installation:

#Unzip
tar zxf diamond-linux64.tar.gz
#Rename
mv diamond ~/bin
#Environment variable settings
echo 'PATH=$PATH:/root/bin' >> ~/.bashrc
#Configuration effective
source ~/.bashrc

Use Flow:

## linux command:
# Build a database
diamond makedb --in nr --db nr
## Sequence alignment
# Nucleic acid
diamond blastx --db nr -q reads.fna -o dna_matches_fmt6.txt
# Protein
diamond blastp --db nr -q reads.faa -o protein_matches_fmt6.txt

Notice:

# When comparing the diamond makedb library with diamond blast, pay attention to the parameters, whether the protein file is used or the nucleic acid file.

Diamond usage

1.Select the desired Blast subroutine Choose between blastp (Protein vs Protein) or blastx (Translated Nucleotide vs Protein) from the dropdown menu.

2.Upload the Target Database file Click to upload the reference library file you want to search against (must be in FASTA format).

3.Upload the Pending Sequence file Click to upload the query sequence file you want to analyze (must be in FASTA format).

4.Enter the E-value threshold Input the expectation value cutoff (default is 1e-5) to filter significant hits.

5.Input the Score Value Enter the minimum alignment score required (default is 100) to filter results.

6.Input the Identity Value Enter the minimum percentage of identity required for a match (default is 60).

7.Click the "Run Diamond" button Submit the form to start the high-performance homology alignment process.

MCScanX

Reference: MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity

Download: MCScanX-master.zip

Installation:

# Unzip:
unzip MCScanX-master.zip

# Compile
make

Use Flow:

## Preparation:
# 1. Put the gff file and blast file into the same folder.
# 2. The gff file should be the merged result of two species' gff files.
# 3. The blast file and gff file must share the same prefix name (e.g., se_so.gff and se_so.blast).

# Run MCScanX
MCScanX se_so

# Plotting covariance points (Dot Plot)
java dot_plotter -g se_so.gff -s se_so.collinearity -c dot.ctl -o dot.PNG

Common error sets:

# Error 1: "msa.cc:289:9: error: ‘chdir’ was not declared in this scope"
# Solve 1: Open "msa.cc", add #include <unistd.h> at the top.

# Error 2: "dissect_multiple_alignment.cc:252:44: error: ‘getopt’ was not declared in this scope"
# Solve 2: Open "dissect_multiple_alignment.cc", add #include <getopt.h> at the top.

# Error 3: "detect_collinear_tandem_arrays.cc:286:17: error: ‘getopt’ was not declared in this scope"
# Solve 3: Open "detect_collinear_tandem_arrays.cc", add #include <getopt.h> at the top.

MCScanX usage

1.Upload the BLAST File Click to upload the pairwise alignment file (usually generated by BLASTP) which contains the homologous gene pairs.

2.Upload the GFF File Click to upload the General Feature Format file that defines the physical positions of the genes on the chromosomes.

3.Input BLAST Information (Optional) Alternatively, if you do not have a file, paste the raw BLAST result data directly into this text area.

4.Input GFF Information (Optional) Alternatively, if you do not have a file, paste the raw GFF coordinate data directly into this text area.

5.Click the "Submit Analysis" button Submit the form to run the MCScanX algorithm for synteny inference and evolutionary analysis.

ColinearScan

Reference: Statistical inference of chromosomal homology based on gene colinearity and applications to Arabidopsis and rice

Use Flow (Step-by-Step):

# 1. Extracting gene pairs from BLAST results
cat ath_chr2_indica_chr5.blast | get_pairs.pl --score 100 > ath_chr2_indica_chr5.pairs

# 2. Masking of highly repetitive loci
cat ath_chr2_indica_chr5.pairs | repeat_mask.pl -n 5 > ath_chr2_indica_chr5.purged

# 3. Estimate maximum gap length
max_gap.pl --lenfile ath_chrs.lens --lenfile indica_chrs.lens --suffix purged

# 4. Detect covariate fragments
block_scan.pl --mg 321000 --mg 507000 --lenfile ath_chrs.lens --lenfile indica_chrs.lens --suffix purged

Shell Script Example:

For efficiency, the above process can be automated using the following script:

#!/bin/sh
do_error()
{
    echo "Error occured when running $1"
    exit 1
}

echo "Start to run the working example..."
echo

echo "* STEP1 Extract pairs from BLAST results"
echo "  We should parse BLAST results and extract pairs of anchors (genes in this example) satisfying our rule (score >= 100)."
cat ath_chr2_indica_chr5.blast | get_pairs.pl --score 100 > ath_chr2_indica_chr5.pairs || do_error get_pairs.pl
echo

echo "* STEP2 Mask highly repeated anchor"
echo "  Highly repeated anchors which are mostly generated by continuous single gene duplication events make those colinear segements vague to be detected. We mask them off using a very simple algorithm."
cat ath_chr2_indica_chr5.pairs | repeat_mask.pl -n 5 > ath_chr2_indica_chr5.purged || do_error repeat_mask.pl
echo

echo "* STEP3 Estimate maximum gap length"
echo "  Use pair files with repeats masked to estimate mg values which will be used to detected colinear blocks."
max_gap.pl --lenfile ath_chrs.lens --lenfile indica_chrs.lens --suffix purged || do_error max_gap.pl
echo

echo "* SETP4 Detect blocks from pair file(s)"
echo "  Everything's ready do scan at last."
block_scan.pl --mg 321000 --mg 507000 --lenfile ath_chrs.lens --lenfile indica_chrs.lens --suffix purged || do_error block_scan.pl
echo

echo "Now ath_chr2_indica_chr5.blocks contains predicted colinear blocks."

Colinearscan usage

1.Upload the BLAST File Click to upload the pairwise sequence alignment file (generated by BLAST) containing the homologous gene pairs.

2.Upload the GFF1 File Click to upload the General Feature Format file describing the gene positions for the first genome (Reference/Query).

3.Upload the GFF2 File Click to upload the General Feature Format file describing the gene positions for the second genome (Target/Subject).

4.Paste BLAST Information (Optional) If you don't have a file, paste the raw BLAST alignment text data directly into this area.

5.Paste GFF1 Information (Optional) If you don't have a file, paste the raw GFF coordinate data for the first genome into this area.

6.Paste GFF2 Information (Optional) If you don't have a file, paste the raw GFF coordinate data for the second genome into this area.

7.Enter the E-value threshold Input the expectation value cutoff (default is 1e-5) to filter out insignificant alignment hits.

8.Input the Score Value Enter the minimum alignment score required (default is 0) to accept a match.

9.Input the Hit Number Enter the maximum number of top hits to consider for analysis (default is 30).

10.Select Position or Order Choose "pos" to use physical chromosomal positions (bp) or "order" to use gene rank order for collinearity calculations.

11.Select Is CDSVCHR Choose "is" or "no" to determine if the software should strictly match Coding Sequence (CDS) IDs to Chromosome IDs.

12.Click the "Run Analysis" button Submit the form to execute ColinearScan and identify conserved gene blocks.

ParaAT

Reference: ParaAT: A parallel tool for constructing multiple protein-coding DNA alignments, Biochem Biophys Res Commun

Download: ParaAT2.0.tar.gz

# Official download address:
https://ngdc.cncb.ac.cn/tools/paraat

Use Flow:

# "ParaAT.pl" is the running script. You can use it directly after downloading and unpacking. 
# Either add the unpacked path to your environment variable or use the absolute path.

# Dependency Tools Required:
# 1. Protein comparison tools (install at least one): clustalw2, mafft, muscle, etc.
# 2. KaKs_Calculator (https://ngdc.cncb.ac.cn/tools/kaks)

# Run ParaAT Command:
ParaAT.pl -h test.homologs -n test.cds -a test.pep -p proc -m muscle -f axt -g -k -o result_dir

# Parameter Explanation:
# -h: Homologs file
# -n: CDS file
# -a: Protein (pep) file
# -p: Number of processors (threads)
# -m: Aligner to use (e.g., muscle)
# -f: Output format (e.g., axt)
# -k: Calculate Ka/Ks (calls KaKs_Calculator)

ParaAT usage

1.Select the Result Output Format Choose the desired file format for the alignment output (e.g., axt, fasta, paml, codon, or clustal) from the dropdown menu.

2.Upload the Homologs file Click to upload the text file containing the list of homologous gene pairs or groups to be analyzed.

3.Upload the CDS Sequence file Click to upload the Coding DNA Sequence (CDS) file corresponding to the gene list (must be in FASTA format).

4.Upload the Protein Sequence file Click to upload the Peptide/Protein (PEP) sequence file corresponding to the gene list (must be in FASTA format).

5.Click the "Run Analysis" button Submit the form to start the parallel alignment and translation process.

KaKs_Calculator

Reference: KaKs_Calculator 3.0: Calculating Selective Pressure on Coding and Non-coding Sequences

Download: KaKs_Calculator3.0.zip

# Official download address:
https://ngdc.cncb.ac.cn/biocode/tools/BT000001

Installation:

# Unzip
unzip KaKs_Calculator3.0.zip

# Compile KaKs
cd KaKs_Calculator3.0 && make
# Main programs generated: KaKs, KnKs, AXTConvertor

# Environment setup
# Add the path to your environment variables.
# Note: This tool often works in conjunction with ParaAT.
# See ParaAT installation here: [Go to ParaAT Section]

Use Flow (via ParaAT):

# 1. Prepare input files
# test.cds: DNA sequence of each gene
# test.pep: Protein sequences for each gene
# proc: A file containing a number indicating the number of CPU calls (e.g., just write '8' in it)

# 2. Start analysis (Calling KaKs_Calculator via ParaAT)
ParaAT.pl -h test.homolog -n test.cds -a test.pep -p proc -m mafft -f axt -g -k -o result_dir

# Parameter explanation:
# -h : homologous gene name file
# -n : file of specified nucleic acid sequences (CDS)
# -a : specified protein sequence file (PEP)
# -p : specifies the file containing thread count
# -m : specifies the comparison tool (clustalw2 | t_coffee | mafft | muscle)
# -g : remove codons with gaps
# -k : use KaKs_Calculator to calculate kaks values
# -o : output directory
# -f : format of the output comparison file (AXT is standard for KaKs_Calculator)

KaKs_Calculator usage

1.Select the Estimate Ka and Ks Method Choose the desired calculation algorithm (e.g., NG, YN, GY, MA) from the dropdown menu to determine how substitution rates are estimated.

2.Upload the KaKs Source File Click to upload the sequence alignment file you want to analyze (must be in AXT format as shown in the example).

3.Click the "Calculate" button Submit the form to initiate the calculation of non-synonymous (Ka) and synonymous (Ks) substitution rates.

HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).

Use Flow (hmmbuild):

# 1. Basic Usage:
# hmmbuild builds a profile HMM from a multiple sequence alignment (MSA).
hmmbuild [-options] <hmmfile_out> <msafile>

# 2. Input/Output Description:
# <msafile> : The input file must be a multiple sequence alignment.
#             Supports formats: CLUSTALW, SELEX, GCG MSF, etc.
# <hmmfile_out> : The output HMM database file (usually .hmm extension).

# 3. Common Options:
# hmmbuild usually automatically detects input type (DNA/Protein).
# You can force the sequence type:
# --amino : Force input to be interpreted as protein sequences.
# --dna   : Force input to be interpreted as DNA sequences.
# --rna   : Force input to be interpreted as RNA sequences.

Hmmer usage

1.Upload the HMM Model Comparison File Click to upload the Hidden Markov Model file (usually with a .hmm extension) representing the protein family or domain profile you wish to use as a query.

2.Upload the Sequence Database File Click to upload the protein sequence library file you want to search against (must be in FASTA format).

3.Enter the E-value Threshold Input the expectation value cutoff (default is 1e-5) to filter out statistically insignificant matches and control the sensitivity of the search.

4.Click the "Run Hmmer" button Submit the form to initiate the profile-based homology search and structure prediction process.

Pfam

Reference: Pfam: the protein families database

Download & Preparation:

# 1. Download Database Files (EBI FTP)
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.seed.gz
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz

# 2. Unzip
gunzip Pfam-A.hmm.gz

# 3. Format the Pfam database (using hmmpress from HMMER package)
hmmpress Pfam-A.hmm

Use Flow:

# Run the program (Example)
nohup pfam_scan.pl \
  -fasta /your_path/masp.protein.fasta \
  -dir /your_path/PfamScan/Pfam_data \
  -outfile masp_pfam \
  -cpu 16 &

# Result Analysis (Output Columns Description):
# (1) seq_id     : Transcript ID (IDs not in list are non-coding)
# (2) hmm start  : Starting position of the domain match
# (3) hmm end    : End position of the domain match
# (4) hmm acc    : ID of the Pfam domain (Accession)
# (5) hmm name   : Name of the Pfam domain
# (6) hmm length : Length of the Pfam domain model
# (7) bit score  : The score of the alignment
# (8) E-value    : Significance. (Filter condition usually: Evalue < 0.001)

Pfam usage

1.Upload the Protein File (PEP) Click to upload the protein sequence file you want to analyze for domain prediction (must be in FASTA format).

2.Enter Conserved Domain Keywords Input the specific structural domain names you wish to filter or extract (e.g., SRF-TF;K-box), separating multiple keywords with semicolons.

3.Click the "Start Analysis" button Submit the form to execute pfamscan and retrieve the domain prediction results.

MEME

Reference: MEME SUITE: tools for motif discovery and searching

Download: meme-5.5.4.tar.gz

Official Releases: Download - MEME Suite

Installation:

# Prerequisites: Perl version 5.10.1+ is required.
# If you need to install Perl manually:
tar zxvf perl.tar.gz
cd /yourpath/perl
./Configure -des -Dprefix=/yourpath/perl_Dusethreads
make && make test && make install
# Add to PATH: export PATH=/yourpath/perl_Dusethreads/bin:$PATH

# Installing MEME:
tar zxf meme.tar.gz
cd meme_5.5.4
./configure --prefix=/yourpath/meme --with-url=http://meme-suite.org --enable-build-libxml2 --enable-build-libxslt
make
make test
make install

Use Flow:

Detailed documentation is available in the MEME Manual.

MEME usage

1.Upload the Target Database file Click to upload the sequence file containing the data in which you want to discover motifs (must be in FASTA format).

2.Select the Sequence Type Choose the biological type of your input sequences (Protein, DNA, or RNA) from the dropdown menu.

3.Select the Distribution Pattern Choose the expected occurrence of the motif per sequence: "Zero or One Occurrence" (ZOOPS), "One Occurrence" (OOPS), or "Any Number of Repetitions" (ANR).

4.Input the Number of Motifs Enter the maximum number of distinct motifs you want the software to identify (default is 3).

5.Input the Minimum Width Enter the minimum length (in residues or nucleotides) allowed for a single motif (default is 2).

6.Input the Maximum Width Enter the maximum length (in residues or nucleotides) allowed for a single motif (default is 10).

7.Click the "Find Motifs" button Submit the form to run the MEME algorithm and start the motif discovery process.

CpgFinder

Reference: CPG Island Finder with Sliding Window Algorithm

Program Options:

# The program is intended to search for CpG islands in sequences.

# 1. Min length of island to find:
# Searching CpG islands with a length (bp) not less than specified.

# 2. Min percent G and C:
# Searching CpG islands with a composition not less than specified.

# 3. Min CpG number:
# The minimal number of CpG dinucleotides in the island.

# 4. Min gc_ratio [P(CpG) / expected P(CpG)]:
# The minimal ratio of the observed to expected frequency of CpG dinucleotide in the island.

# 5. Extend island:
# Extending the CpG island if its length is shorter than required.

Output Example:

Search parameters:  len: 200   %GC: 50.0   CpG number: 0   P(CpG)/exp: 0.600   extend island: no   A: 21   B: -2
Locus name:  9003..16734 note="CpG_island (%GC=65.4, o/e=0.70, #CpGs=577)"
Locus reference:   expected P(CpG): 0.086   length: 25020
    20.1%(a)  29.9%(c)  28.6%(g)  21.4%(t)   0.0%(other)

                FOUND 4 ISLANDS
  #     start      end   chain   CpG    %CG    CG/GC    P(CpG)/exp     P(CpG)    len
  1      9192    10496     +     161   73.0    0.847   0.927( 1.44)    0.123    1305
  2     11147    11939     +      87   69.2    0.821   0.917( 1.28)    0.110     793
  3     15957    16374     +      57   79.4    0.781   0.871( 1.60)    0.137     418
  4     14689    15091     +      49   74.2    0.817   0.887( 1.42)    0.122     403

CpgFinder usage

1.Upload the Nucleic Acid Sequence file Click to upload the DNA sequence file you want to analyze for CpG island prediction (e.g., in FASTA or plain text format).

2.Input the Minimum Island Length Enter the minimum length in base pairs required for a region to be defined as a CpG island (default is 200 bp).

3.Input the Minimum GC Percentage Enter the minimum percentage of Guanine and Cytosine content required within the sliding window (default is 50%).

4.Input the Minimum CpG Ratio Enter the minimum threshold for the observed-to-expected CpG dinucleotide ratio (default is 0.6).

5.Click the "Find CpG Islands" button Submit the form to execute the algorithm and identify genomic regions matching the specified criteria.

CodonW

Reference: Codon Pattern of Papillomavirus (Type I) from Bos Grunniens Based on the CodonW Software

Download: CodonWSourceCode_1_4_4.zip

Installation (Linux/Conda):

# Install directly with conda:
conda install codonw

# Run:
codonw

# Expected Initial Menu Output:
# Welcome to CodonW  for Help type h
# Initial Menu
# Option
# (1) Load sequence file
# (3) Change defaults
# (4) Codon usage indices
# (5) Correspondence analysis
# (7) Teach yourself codon usage
# (8) Change the output written to file
# (9) About C-codons
# (R) Run C-codons
# (Q) Quit

Available Indices (Menu Option 4):

 Codon usage indices Options:
 ( 1) {Codon Adaptation Index       (CAI)          }
 ( 2) {Frequency of OPtimal codons  (Fop)          }
 ( 3) {Codon bias index             (CBI)          }
 ( 4) {Effective Number of Codons   (ENc)          }
 ( 5) {GC content of gene           (G+C)          }
 ( 6) {GC of silent 3rd codon posit.(GC3s)         }
 ( 7) {Silent base composition                     }
 ( 8) {Number of synonymous codons  (L_sym)        }
 ( 9) {Total number of amino acids  (L_aa )        }
 (10) {Hydrophobicity of protein    (Hydro)        }
 (11) {Aromaticity of protein       (Aromo)        }
 (12)  Select all
 (X)   Return to previous menu

CodonW usage

1.Upload the Sequence File Click to upload the specific sequence file you want to analyze (must be in FASTA format).

2.Click the "Start Analysis" button Submit the form to start the CodonW prediction and analysis process.

IQ-Tree

Reference: IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies

Installation:

conda install iqtree

Use Flow:

# 1. Automatic model selection and tree inference:
iqtree -s example.phy -m MF -mtree -T AUTO

# 2. Specify a substitution model (e.g., GTR+I+G):
iqtree -s example.phy -m GTR+I+G

# 3. ModelFinder + Tree + UltraFast Bootstrap (1000 replicates):
iqtree -s example.phy -m MFP -b 1000 -T AUTO

# 4. ModelFinder + Tree + UltraFast Bootstrap + BN-NNI search:
iqtree -s example.phy -m MFP -B 1000 --bnni -T AUTO

IQ-Tree usage

1.Upload the Alignment File Click to upload your Multiple Sequence Alignment (MSA) file. Note that this must be an aligned sequence file (DNA or Protein), not raw sequences.

2.Select the Alignment Format Choose the specific format of your uploaded file (FASTA, PHYLIP, or NEXUS) from the dropdown menu to ensure correct parsing.

3.Select the Substitution Model Choose a specific evolutionary substitution model (e.g., GTR+G for DNA, LG+G for Protein) or keep the default "Auto (ModelFinder Plus)" to let the system automatically detect the best-fit model.

4.Input the Bootstrap Value Enter the number of bootstrap replicates (default is 1000) to calculate the statistical support for the tree branches.

5.Click the "Run IQ-TREE & Download" button Submit the form to start the Maximum Likelihood tree construction process.

FastTree

Reference: FastTree: computing large minimum evolution trees with profiles instead of a distance matrix

Download: FastTree

Installation:

conda install fasttree

Use Flow:

# Nucleotide alignment:
fasttree -nt nucleotide_alignment_file > tree_file

# Protein alignment:
fasttree protein_alignment_file > tree_file

FastTree usage

1.Select the Substitution Model Choose the appropriate evolutionary model (e.g., JTT+CAT or GTR+CAT) from the dropdown menu for tree construction.

2.Upload the Comparison File Click to upload the multiple sequence alignment file (usually in FASTA or .aln format) required for the analysis.

3.Click the "Build Phylogenetic Tree" button Submit the form to execute FastTree and generate the phylogenetic tree structure.

SequenceServer

Reference: Sequenceserver: A Modern Graphical User Interface for Custom BLAST Databases

Installation:

# 1. Install Ruby and dependencies
sudo apt-get install ruby gem ruby-dev

# 2. Install SequenceServer gem
sudo gem install sequenceserver

# 3. Prepare Database (Example: Copying to server)
# scp /local/dbindex user@ip:/home/db

# 4. Run SequenceServer (pointing to database directory)
sequenceserver -m -d /db

SequenceServer usage

1. Enter the target sequence to be matched

2. Select the corresponding database

3. Enter the required parameters

4. Click the submit button

Synvisio

Description:

Synvisio is an interactive multiscale visualization tool that allows you to explore the results of McScanX. We have created a complete genome library of species on the server, where users can select species for synteny analysis and display detailed synteny relationships between different chromosomes.

Synvisio usage

1.Select Species Database 1 Choose the reference species from the first dropdown menu (e.g., Arachis duranensis) to define the base genome for comparison.

2.Select Species Database 2 Choose the target species from the second dropdown menu to analyze collinearity blocks and synteny relationships against the first species.

3.Click the "Submit Analysis" button Submit the form to initiate the Synvisio calculation and generate the visualization results.

NG_KaKs_Cal

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: NG-KaKs-Cal_setup_v1.0.zip

Description:

KaKs represents the ratio between the non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) of two protein-coding genes. This ratio can determine whether there is selective pressure on the protein-coding gene.

NG_KaKs_Cal usage

1.Upload the Synteny Block File Click to upload the file containing the collinearity blocks or orthologous gene pairs for the species being analyzed (e.g., .txt or .block format).

2.Upload the First Species CDS File Click to upload the coding sequence file for the first species (must be in FASTA or CDS format).

3.Upload the Second Species CDS File Click to upload the coding sequence file for the second species (must be in FASTA or CDS format).

4.Click the "Run Calculation" button Submit the form to initiate the pipeline and calculate the Ka/Ks ratios for the provided sequences.

BBK_Dotter

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: BBK-Dotter_setup_v1.0.zip

Description:

BBK_Dotter is a tool for generating genome structure lattice diagrams (dotplots) based on the results of Blast sequence alignment and block synteny. It labels Ks values next to the syntenic blocks.

Required Files:

BLAST file, Block file, Ks file, Species Lens files, and Species GFF files.

BBK_Dotter usage

1.Upload the BLAST file Click to upload the sequence alignment result file (usually in tabular format) representing the homology search between the two species.

2.Upload the Block file Click to upload the synteny block file (e.g., generated by MCScanX) that defines the collinear regions between the genomes.

3.Upload the Ks file Click to upload the file containing the calculated Ks (synonymous substitution rate) values for the gene pairs.

4.Upload Species 1 Lens file Click to upload the chromosome length file (Lens file) for the first species (Species 1) to define the genomic coordinates.

5.Upload Species 2 Lens file Click to upload the chromosome length file (Lens file) for the second species (Species 2) to define the genomic coordinates.

6.Upload Species 1 GFF file Click to upload the General Feature Format (GFF) file containing gene annotation information for the first species.

7.Upload Species 2 GFF file Click to upload the General Feature Format (GFF) file containing gene annotation information for the second species.

8.Click the "Run Analysis" button Submit the form to process the data and generate the Ks dotplot visualization.

GSDS

Reference: GSDS 2.0: an upgraded gene feature visualization server

Download: gsds_v2.tar.gz

Installation:

# a. Change to the path for installing GSDS 2.0 and unpack the tar package.
cd $PATH2INSTALL_GSDS
tar -zxvf gsds_v2.tar.gz

# b. Modify the authentication of task directories and log file
cd $PATH2INSTALL_GSDS/gsds_v2
mkdir task task/upload_file
chmod 777 task/
chmod 777 task/upload_file/
chmod a+w gsds_log

# c. Link CGI commands in directory gcgi_bin to the commands in your system
cd $PATH2INSTALL_GSDS/gsds_v2/gcgi_bin
ln -s -f seqretsplit
ln -s -f est2genome
ln -s -f bedtools
ln -s -f rsvg-convert

# d. Configure Apache2 for accessing GSDS 2.0
# (Follow specific Apache configuration steps for your server environment)

Seq alignment

Description:

Multiple sequence alignment tool selection, utilizing industry-standard tools including MAFFT, MUSCLE, and ClustalW2.

Seq alignment usage

1.Upload the Protein File (PEP) Click to upload the file containing the protein sequences you want to align (must be in FASTA or PEP format).

2.Select the Multiple Sequence Alignment Tool Choose the desired alignment software from the dropdown menu: mafft (known for speed), muscle (high accuracy), or clustalw2 (classic algorithm).

3.Click the "Start Alignment" button Submit the form to initiate the alignment process and analyze the evolutionary relationships between the sequences.

Phyml

Reference: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0

Download: PhyML-3.1.zip

Installation:

unzip PhyML-3.1.zip
mv PhyML-3.1 /opt/biosoft/
ln -s /opt/biosoft/PhyML-3.1/PhyML-3.1_linux64 /opt/biosoft/PhyML-3.1/PhyML
echo 'PATH=$PATH:/opt/biosoft/PhyML-3.1/' >> ~/.bashrc
source ~/.bashrc

Use Flow:

# Standard Command:
PhyML -i proteins.phy -d aa -b 1000 -m LG -f m -v e -a e -o tlr

# Parameter Explanation:
# -i : sequence file name (input)
# -d : data type (nt for nucleotide, aa for amino acid). Default: nt
# -b : bootstrap replicates (int)
# -m : substitution model (e.g., LG, GTR)
# -f : equilibrium frequencies (e, m, or fA,fC,fG,fT)
# -v : proportion of invariant sites (prop_invar)
# -a : gamma shape parameter
# -o : optimize parameters (tlr)

Phyml usage

1.Upload Alignment File Click to upload the sequence alignment file (supported formats: .aln, .phy) containing the aligned sequences for analysis.

2.Select Type of Data Choose the data type of your sequences: aa (Amino Acid), nt (Nucleotide), or generic from the dropdown menu.

3.Select Model Category Choose between "Nucleic Acid" or "Protein" in the first model selection box to filter the available evolutionary substitution models.

4.Select Substitution Model Choose the specific evolutionary model (e.g., HKY85, GTR for nucleotides; LG, WAG for proteins) from the second dropdown menu.

5.Select Frequency Calculation Choose the method for calculating equilibrium frequencies (e.g., e, m, fa, fC, fG, fT) to estimate nucleotide or amino acid frequencies.

6.Select Tree Improvement Strategy Choose the parameters to be optimized during tree search: Topology (t), Branch Lengths (l), and/or Rate Parameters (r).

7.Input Bootstrap Replicates Enter the number of bootstrap replicates (default is 500) to assess the reliability of the phylogenetic tree branches.

8.Enter Email Address Input a valid email address to receive the results and notifications once the analysis is complete.

9.Click the "Run Phyml" button Submit the form to start the maximum likelihood analysis and tree construction process.

SS-extractor

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: SS-extractor_setup_v1.0.zip

Core Logic (Python):

Extracts specific Protein, CDS, and GFF records based on a provided ID list.

class sequence_run2(object):
    def __init__(self, place):
        self.id_list = []
        self.place = place

    def new_prot(self):
        new_prot = open(f'{path_get}/file_keep/{self.place}/new_pro.txt', 'w')
        for line in SeqIO.parse(f'{path_get}/file_keep/{self.place}/protein.fasta', 'fasta'):
            if line.id in self.id_list:
                new_prot.write('>' + str(line.id) + '\n' + str(line.seq) + '\n')

    def new_cds(self):
        new_cds = open(f'{path_get}/file_keep/{self.place}/new_cds.txt', 'w')
        for line in SeqIO.parse(f'{path_get}/file_keep/{self.place}/gene.fasta', 'fasta'):
            if line.id in self.id_list:
                new_cds.write('>' + str(line.id) + '\n' + str(line.seq) + '\n')

    def new_gff(self):
        new_gff = open(f'{path_get}/file_keep/{self.place}/new_gff.txt', 'w')
        gff_file = open(f'{path_get}/file_keep/{self.place}/gff.fasta', 'r')
        for line in gff_file:
            gff_id = line.split()[1]
            if gff_id in self.id_list:
                new_gff.write(line)

    def main(self):
        id_file = open(f'{path_get}/file_keep/{self.place}/id.fasta', 'r')
        for line in id_file.readlines():
            self.id_list.append(line.split()[0])
            self.new_prot()
            self.new_cds()
            self.new_gff()

SS-extractor usage

1.Upload the ID List File Click to upload a text file containing the list of specific IDs you want to extract (format: one ID per line).

2.Upload the Protein Sequence File Click to upload the protein sequence file in FASTA format (optional, but at least one data file is required).

3.Upload the Gene/CDS Sequence File Click to upload the coding sequence (CDS) file in FASTA format (optional).

4.Upload the GFF / Table File Click to upload the GFF or table file; the tool will extract lines where the second column matches your IDs (optional).

5.Click the "Filter & Download ZIP" button Submit the form to process the files and download the extracted records as a ZIP archive.

SAF-converter

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: SAF-converter_setup_v1.0.zip

Core Logic (Python):

Converts sequence file formats using BioPython's SeqIO.convert.

def format_fasta_run(place, file_name, patten, patten_end, file_last):
    SeqIO.convert(
        f"{path_get}/file_keep/{place}/{file_name}.{file_last}", 
        f"{patten}",
        f"{path_get}/file_keep/{place}/{file_name}.{patten_end}", 
        f"{patten_end}"
    )

SAF-converter usage

1.Upload the Sequence File Click to upload the biological sequence file containing the data you want to convert.

2.Select the Input Format Choose the specific format that matches your uploaded file (e.g., FASTA, GenBank, EMBL) from the first dropdown menu.

3.Select the Output Format Choose the desired target format you wish to convert your file into (e.g., Phylip, Nexus, Tabular) from the second dropdown menu.

4.Click the "Convert & Download" button Submit the form to initiate the conversion process and download the result.

BDI-deduplicator

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: BDI-deduplicator_setup_v1.0.zip

Core Logic (Python):

Removes duplicate sequences based on ID uniqueness.

def quchong_run(place):
    id_list = []
    new_file = open(f'{path_get}/file_keep/{place}/new.fasta', 'w')
    for line in SeqIO.parse(f'{path_get}/file_keep/{place}/1.fasta', 'fasta'):
        if line.id not in id_list:
            id_list.append(line.id)
            new_file.write(">" + str(line.id) + "\n" + str(line.seq) + "\n")
    new_file.close()

BDI-deduplicator usage

1.Upload the FASTA File Click to upload the sequence file containing the records you wish to deduplicate (supports .fasta, .fa, .faa, .fna, or .txt formats).

2.Click the "Deduplicate & Download" button Submit the form to process the file; the tool will filter out duplicate sequences based on their IDs and download the cleaned file.

BDI-extractor

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: BDI-extractor_setup_v1.0.zip

Core Logic (Python/Shell):

Extracts a specific column from a file (e.g., GFF) using the Linux `cut` command.

def extr_row_run(place, row_num, row_name, row_last):
    cmd = f"""
        cd {path_get}/file_keep/{place}
        cut -f {row_num} {row_name}.{row_last} > {row_name}.new.{row_last}
    """
    subprocess.run(cmd, shell=True, check=True)

BDI-extractor usage

1.Upload the GFF3 / Tabular File Click to upload the structured text file containing the data you want to extract (supports .gff3, .txt, .csv, or .tsv formats).

2.Enter the Columns to Extract Input the specific column numbers you wish to retrieve, separated by commas (e.g., enter "1,4,5" to extract the first, fourth, and fifth columns).

3.Click the "Extract & Download" button Submit the form to process the file and automatically download the new file containing only the selected columns.

BDI-combiner

Core Logic (Python/Shell):

Merges multiple files from a directory into a single result file.

def file_merge_run_tools(place):
    cmd = f"""
        cd '{path_get}/file_keep/{place}'
        cat orthomcl/* >> result.txt
    """
    subprocess.run(cmd, shell=True, check=True)

BDI-combiner usage

1.Select Multiple Files to Merge Click to browse and select the multiple files you want to combine into a single file (hold Ctrl or Command key to select multiple items).

2.Filter by Extension (Optional) Input a file suffix (e.g., .fasta) if you want to specifically merge only the files with that extension from your selection.

3.Enter Output Filename (Optional) Input the desired name for the resulting merged file (default is merged_result.txt if left empty).

4.Click the "Combine & Download" button Submit the form to merge the selected files and download the consolidated result.

ShinyCircos

Reference: shinyCircos: an R/Shiny application for interactive creation of Circos plot

Download: shinyCircos-master.zip

Installation (R Packages):

install.packages("shiny")
install.packages("circlize")
install.packages("RColorBrewer")
install.packages("data.table")
install.packages("RLumShiny")

# Bioconductor packages
source("https://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

Server Configuration (Nginx/Shiny Server):

# Define the user to spawn R Shiny processes
run_as shiny;

# Define a top-level server which will listen on a port
server {
  # Use port 3838
  listen 3838;
  
  # Define the location available at the base URL
  location /shinycircos {
    # Directory containing the code and data of shinyCircos
    app_dir /srv/shiny-server/shinyCircos;
    # Directory to store the log files
    log_dir /var/log/shiny-server;
  }
}

B_Dotter

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: B_Dotter_setup_v1.0.zip

Use Flow:

# Run the Perl script
perl dotplot.pl

B_Dotter usage

1.Upload the BLAST Output file Click to upload the BLAST result file that contains the alignment information between the two species.

2.Upload the Species 1 Lens file Click to upload the chromosome length file for the first species (usually for the vertical axis).

3.Upload the Species 2 Lens file Click to upload the chromosome length file for the second species (usually for the horizontal axis).

4.Upload the Species 1 GFF file Click to upload the gene feature file (GFF) containing coordinate information for the first species.

5.Upload the Species 2 GFF file Click to upload the gene feature file (GFF) containing coordinate information for the second species.

6.Click the "Run Analysis" button Submit the form to process the uploaded files and generate the synteny dotplot.

BB_Dotter

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: BB_Dotter_setup_v1.0.zip

Use Flow:

# Run the Python script
python dotplot_block.2400.pd.py

BB_Dotter usage

1.Upload the BLAST Output file Click to upload the BLAST result file containing the alignment information between the two species.

2.Upload the Block file Click to upload the synteny block file (usually generated by synteny analysis tools like MCScanX, e.g., .rr.txt).

3.Upload the Species 1 Lens file Click to upload the chromosome length file for the first species (usually displayed on the vertical axis).

4.Upload the Species 2 Lens file Click to upload the chromosome length file for the second species (usually displayed on the horizontal axis).

5.Upload the Species 1 GFF file Click to upload the gene feature file (GFF) containing gene coordinate information for the first species.

6.Upload the Species 2 GFF file Click to upload the gene feature file (GFF) containing gene coordinate information for the second species.

7.Enter the Block Number Input the minimum number of gene pairs required to define a valid synteny block (default is 10).

8.Click the "Run Analysis" button Submit the form to process the data and generate the block dotplot visualization.

Paleo-gene_identifer

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Paleo-gene_identifer_setup_v1.0.zip

Use Flow:

# Run the Python script
python corr_dotplot_spc_last.py

Paleo-gene_identifer usage

1.Upload the BLAST File Click to upload the BLAST result file containing the alignment information between the two species.

2.Upload the Block File Click to upload the synteny block file (e.g., .block.rr.txt) defining the collinear regions.

3.Upload the Correspondence File Click to upload the correspondence file that maps the relationship between the genomes (e.g., .corr.txt).

4.Enter Species 1 Name Input the short abbreviation for the first species (e.g., Aip) used in the analysis.

5.Upload Species 1 Lens File Click to upload the chromosome length file for the first species.

6.Upload Species 1 GFF File Click to upload the gene feature file (GFF) containing gene coordinates for the first species.

7.Enter Species 2 Name Input the short abbreviation for the second species (e.g., Adu) used in the analysis.

8.Upload Species 2 Lens File Click to upload the chromosome length file for the second species.

9.Upload Species 2 GFF File Click to upload the gene feature file (GFF) containing gene coordinates for the second species.

10.Click the "Run Analysis" button Submit the form to start the synteny inference and evolutionary analysis process.

Paleo-gene_RI

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Paleo-gene_RI_setup_v1.0.zip

Use Flow:

# Run the Python script
python baoliu.py

Paleo-gene_RI usage

1.Upload the MC File Click to upload the Correspondence (MC) file containing the gene mapping information required for retention analysis.

2.Enter Email Address Input a valid email address to receive the analysis results once the processing is complete.

3.Click the "Submit Analysis" button Submit the form to start the gene retention calculation process.

Paleo-gene_RII

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Paleo-gene_RII_setup_v1.0.zip

Use Flow:

# Run the Python script
python lost.py

Paleo-gene_RII usage

1.Upload the MC File Click to upload the Correspondence result file (usually a .txt file) containing the gene mapping information required to calculate gene loss.

2.Click the "Submit Analysis" button Submit the form to initiate the statistical calculation of paleo-gene loss.

WGDI

Reference: WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes

Documentation:

Detailed tutorials and step-by-step usage instructions are available in the official documentation:

WGDI ReadTheDocs

DupGen_finder

Reference: Gene duplication and evolution in recurring polyploidization-diploidization cycles in plants

Download: DupGen_finder-master.zip

Installation:

cd DupGen_finder
make
chmod 775 DupGen_finder.pl
chmod 775 DupGen_finder-unique.pl
chmod 775 set_PATH.sh
source set_PATH.sh

Gene Duplication Classification:

WGD: Whole genome duplication
TD: Tandem duplication (two duplicated genes next to each other)
PD: Proximal duplication (duplicated genes within 10 genes apart)
TRD: Transpositional duplication (ancestor + new locus)
DSD: Scattered duplication (not adjacent nor coterminous)
SL: Single copy

Use Flow (Example):

# 1. Prepare GFF files (Spd: Experimental, Ath: Control)
cat Spd.bed | sed 's/^/Spd-/g' | awk '{print $1"\t"$4"\t"$2"\t"$3}' > Spd.gff
cat Ath.bed | sed 's/^/Ath-Chr/g' | awk '{print $1"\t"$4"\t"$2"\t"$3}' > Ath.gff
sed -i 's/Chr0/Chr/g' Spd.gff
cat Spd.gff Ath.gff > Spd_Ath.gff

# 2. BLAST Search
# Build DB and Blast for Spd (Self)
makeblastdb -in Spd.pep -dbtype prot -title Spd -parse_seqids -out Spd
blastp -query Spd.pep -db Spd -evalue 1e-10 -max_target_seqs 5 -outfmt 6 -out Spd.blast

# Build DB and Blast for Ath (Reference)
makeblastdb -in Ath.pep -dbtype prot -title Ath -parse_seqids -out Ath
blastp -query Ath.pep -db Ath -evalue 1e-10 -max_target_seqs 5 -outfmt 6 -out Ath.blast

# Combine Results
mkdir Spd_Ath
cat Spd.blast Ath.blast > Spd_Ath.blast

# 3. Run DupGen_finder
# General mode (-t: target, -c: control/outgroup)
DupGen_finder.pl -i $PWD -t Spd -c Ath -o ${PWD}/Spd_Ath/results1

# Strict mode
DupGen_finder-unique.pl -i $PWD -t Spd -c Ath -o ${PWD}/Spd_Ath/results2

Duplication_Type usage

1.Input the Query Sequence Paste your query sequence in FASTA format into the first text area to identify duplication types based on sequence similarity.

2.Input the Gene IDs Alternatively, if you already have the gene identifiers, paste the list of Gene IDs (e.g., Aann1g00011) into the second text area (skip Step 1 if using this option).

3.Select the Query Genome Choose the specific species from the dropdown menu (e.g., Artemisia annua, Vitis vinifera) to serve as the reference for duplication classification.

4.Click the "Search" button Submit the form to process the data and categorize the genes into duplication types (WGD, TD, PD, TRD, DSD).

GF_circos

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: GF_circos_setup_v1.0.zip

Description:

The Syn-deco Generator is a specialized bioinformatics tool designed for Synteny Inference and Evolutionary Analysis on the Asteraceae Platform. Its primary function is to construct comprehensive multi-group genome lists based on Correspondence results (MC files) derived from previous analysis steps. This utility is essential for studying paleogenome fractionation, allowing researchers to model the reference genome as either duplicated (1:2) or triplicated (1:3). Users are required to upload their MC result files (typically in a compressed format like .zip), select the appropriate ploidy pattern, and specify the reference filename and species ID to generate structured output for downstream evolutionary studies.

Use Flow:

# Run the Python script with the configuration file
python .\run.py excircle -c excircle.conf

GF-Circos usage

1.Upload the KS File Click to upload the file containing the Ks (synonymous substitution rate) values for the gene pairs.

2.Upload the Pair File Click to upload the file listing the specific gene pairs to be visualized.

3.Upload the Classify File Click to upload the classification file that categorizes the gene pairs (usually by duplication type or family).

4.Upload the GFF File Click to upload the Gene Feature File (GFF) containing the chromosomal coordinate information for the genes.

5.Upload the Lens File Click to upload the chromosome length file required to draw the backbone of the Circos plot.

6.Click the "Draw Circos" button Submit the form to process the uploaded data and generate the circular visualization.

Transcriptome Analysis

Workflow:

# I. Convert SRA data to fastq format using fasterq-dump
fasterq-dump --split-3 *.sra

# II. Use fastQC to evaluate the quality of fastq files
fastqc *fq

# III. Trimming (Remove adapters, prune bases, filter low quality)
# Double-ended (PE) sequencing:
trimmomatic PE -threads 4 -phred33 *_1.fastq.gz *_2.fastq.gz *_1_clean.fq *_1_unpaired.fq *_2_clean.fq *_2_unpaired.fq \
 ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50 

# Single-ended (SE) sequencing:
trimmomatic SE -threads 4 -phred33 *.fastq.gz *_clean.fq \
 ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50

# IV. Build an index of the reference genome
hisat2-build -p 3 *.fa *.index

# V. Align fastq sequences to the reference genome
# Double-ended (PE):
hisat2 --new-summary --rna-strandness RF -p 10 -x *.index -1 *_1_clean.fq -2 *_2_clean.fq -S *.sam 
# Single-ended (SE):
hisat2 --new-summary --rna-strandness R -p 10 -x *.index -U *_clean.fq -S *.sam 

# VI. Convert SAM to BAM and Sort
samtools sort -o *.bam *.sam

# VII. Quantitative analysis of gene expression
stringtie -e -A *.out -p 4 -G *.gtf -o *.gtf *.bam

JBrowse2

Download: https://jbrowse.org/jb2/

Description:

JBrowse is a fast, scalable genome browser with a fully dynamic AJAX interface. It performs most work directly in the user's web browser, requiring minimal server resources.

Use Flow:

# 1. Process genome files (Index FASTA)
samtools faidx Psa.fasta

# 2. Process GFF3 files (Sort, Zip, and Index)
# Sort and tidy the GFF3 file
gt gff3 -sortlines -tidy -retainids Psa.gff > Psa.sorted.gff

# Compress with bgzip
bgzip Psa.sorted.gff

# Index with tabix
tabix Psa.sorted.gff.gz

Interproscan

Download: GitHub Repository

Installation:

# 1. Download and Extract
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.61-93.0/interproscan-5.61-93.0-64-bit.tar.gz
tar -pxvzf interproscan-5.61-93.0-*-bit.tar.gz
cd interproscan-5.61-93.0

# 2. Set Environment Variables
export PATH=`pwd`:$PATH

# * Note: Software relies on Java 11 or above
export PATH=/home/public/tools/jdk-17.0.1/bin:$PATH

Use Flow:

# A. For Protein Sequences
interproscan.sh -cpu 40 -d anno.dir -dp -i protein.fa

# B. For Nucleic Acid (Transcript) Sequences
interproscan.sh -cpu 40 -d anno.dir -dp -t n -i transcripts.fa

# Parameter Explanation:
# -cpu : Number of threads
# -d   : Output directory
# -dp  : Disable pre-calculated match lookup (force local calculation)
# -i   : Input sequence file
# -t   : Sequence type (n for nucleic acid, p for protein [default])

BDI-finder

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: BDI-finder_setup_v1.0.zip

Description:

The BDI-finder is a specialized calculation tool designed to streamline genome sequence data management on the Asteraceae Platform. Its primary function is to perform efficient "find and replace" operations within large sequence files. This utility is particularly essential for standardizing chromosome or scaffold nomenclature (e.g., converting "scaffold_01" to "Chr01") across datasets. Users simply upload their raw data file, specify the target character string to identify, and define the replacement string to generate a corrected output file instantly.

GF-Circos usage

1.Upload the Sequence File Click to upload the raw data file you wish to process (supports formats such as FASTA, TXT, or CSV).

2.Input the Find String (Old Name) Enter the specific character string or pattern currently exists in the file that you want to target (e.g., "scaffold_").

3.Input the Replacement String (New Name) Enter the new character string that will replace the target text defined in the previous step (e.g., "Chr").

4.Click the "Run Calculation & Download" button Submit the form to execute the find-and-replace operation and initiate the download of the processed file.

blast_match

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: blast_match_setup_v1.0.zip

Description:

Bl-match is a robust homology search and alignment utility designed to facilitate comparative genomic analysis on the platform. It enables researchers to perform BLAST (Basic Local Alignment Search Tool) operations against a curated repository of species-specific databases. To initiate a search, users must upload their query sequences in FASTA format and select the appropriate algorithm (blastp for proteins or blastn for nucleotides). The tool offers granular control over search sensitivity by allowing users to define critical filtering parameters, including E-value, Alignment Score, and Identity percentage, ensuring precise and relevant results.

blast_match usage

1.Upload the Target Database file Click the upload area to select and upload your sequence file in FASTA format (e.g., .fasta).

2.Select the Blast Subroutine Choose the appropriate alignment algorithm from the dropdown menu: use "blastp" for protein sequences or "blastn" for nucleotide sequences.

3.Select the Species Database Choose the specific target organism from the provided species list to perform the alignment against.

4.Enter the E-value threshold Input the expectation value cutoff (default is 1e-5) to filter out statistically insignificant alignment results.

5.Input the Score Value Enter the minimum alignment score required for a match to be retained (default is 100).

6.Input the Identity Value Enter the minimum percentage of sequence identity required (default is 60) to define a valid match.

7.Click the "Run Blast Match" button Submit the form to initiate the homology search and alignment process.

block_number_count

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: block_number_count_setup_v1.0.zip

Description:

The Block number count tool is a specialized analytical utility designed to process sequence alignment data on the Asteraceae Platform. Its primary function is to perform statistical analysis on collinear blocks to evaluate genome synteny. This utility supports batch processing, allowing researchers to upload multiple block files (in .blast or .txt formats) simultaneously. Users simply upload their collinear block datasets to generate comprehensive statistical reports on block numbers instantly.

Block_number_count usage

1.Upload the Collinear Block files Click to select and upload the data files you want to analyze (supports .blast and .txt formats); you can select multiple files at once.

2.Click the "Submit Analysis" button Submit the form to execute the block number statistical calculation and initiate the result download.

Chromosome_length_extractor

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Chromosome_length_extractor_setup_v1.0.zip

Description:

The Seglenth-calculator is a specialized statistical tool designed to analyze nucleic acid sequences on the Asteraceae Platform. Its primary function is to calculate the precise length of individual sequences, making it essential for generating statistics on chromosomes or gene models (CDS). Users simply upload a file in FASTA format containing their sequences. The tool processes the data and automatically generates a comprehensive result table listing the specific ID and calculated length for each entry, facilitating quick assessment of genomic structure.

Chromosome_length_extractor usage

1.Upload the Sequence File Click to upload the nucleic acid sequence file (e.g., CDS or Chromosome sequences) you want to analyze (must be in FASTA format).

2.Click the "Submit Analysis" button Submit the form to calculate the length of each sequence and generate the statistics table.

3.Download the Result Table (Optional) Once the calculation is complete and the table is displayed, click this button to save the ID and Length data as a text file.

Collinearity_generator

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Collinearity_generator_setup_v1.0.zip

Description:

The Syn-deco Generator is a specialized bioinformatics tool designed for Synteny Inference and Evolutionary Analysis on the Asteraceae Platform. Its primary function is to construct comprehensive multi-group genome lists based on Correspondence results (MC files) derived from previous analysis steps. This utility is essential for studying paleogenome fractionation, allowing researchers to model the reference genome as either duplicated (1:2) or triplicated (1:3). Users are required to upload their MC result files (typically in a compressed format like .zip), select the appropriate ploidy pattern, and specify the reference filename and species ID to generate structured output for downstream evolutionary studies.

Collinearity_generator usage

1.Select the Reference Genome Ploidy Choose the appropriate ploidy ratio (Duplicated 1:2 or Triplicated 1:3) from the dropdown menu to define the structure of the reference genome.

2.Upload the MC Results files Click to upload the correspondence result files generated from previous analysis steps (supports multiple files or compressed archives).

3.Enter the Reference File Name Input the specific filename of the reference species' self-alignment result (e.g., "Vvi_Vvi.mc.txt").

4.Enter the Reference Species Name Input the abbreviation or identifier for the reference species (e.g., "Vvi").

5.Click the "Generate List" button Submit the form to construct the multi-group genome list and initiate the evolutionary analysis process.

Deal_stars

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Deal_stars_setup_v1.0.zip

Description:

The Deal_stars is a specialized sequence processing tool designed to enhance data compatibility on the Asteraceae Platform. Its primary function is to automatically eliminate non-standard characters—specifically asterisks (*) and dots (.)—from nucleotide or protein sequences. This utility is essential for cleaning raw data containing stop codon markers or alignment gaps that may interfere with downstream analysis tools. Users simply upload a FASTA-formatted file, and the tool generates a sanitized output file containing pure sequence strings while preserving the integrity of the original headers.

Deal_stars usage

1.Upload the Sequence File Click to upload the FASTA format file (supports .fasta, .fa, .txt, etc.) containing the sequences from which you want to remove asterisks (*) and dots (.).

2.Click the "Clean & Download" button Submit the form to process the file, stripping out the unwanted characters while preserving headers, and initiate the download of the cleaned file.

Draw_list_trees

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Draw_list_trees_setup_v1.0.zip

Description:

The Multi-syn-phytre is a high-throughput phylogenetic analysis tool designed to automate Synteny Inference and Evolutionary Tree Formation on the Asteraceae Platform. Its primary function is to perform batch construction of phylogenetic trees based on orthogroups identified in collinearity studies. This utility significantly streamlines evolutionary analysis by integrating protein sequence (PEP) files with a collinearity list file to generate trees for multiple gene groups simultaneously. Users can upload their sequence data, provide the corresponding list, and optionally configure root node settings to efficiently reconstruct the evolutionary history of syntenic blocks.

Draw_list_tree usage

1.Set the Root Node option Select "True" or "False" from the dropdown menu to determine whether the generated phylogenetic trees should be rooted.

2.Upload the Protein Sequence (PEP) files Click to upload the protein sequence files for the relevant species (multiple files can be selected simultaneously).

3.Upload the Collinearity List file Click to upload the specific text file containing the orthogroup or collinearity list constructed in the previous analysis.

4.Click the "Construct Trees" button Submit the form to automate the batch construction of phylogenetic trees based on the uploaded inputs.

GC-calculation

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: GC-calculation_setup_v1.0.zip

Description:

The GC Content Calculator is a specialized analytical tool on the Asteraceae Platform designed to compute detailed base composition statistics for nucleic acid sequences. Its primary function is to determine the precise length and specific nucleotide distribution (A, T, C, G) for individual genes or genomic fragments. This utility is essential for assessing genomic stability and characterizing coding sequences (CDS). Users need only upload a nucleic acid file in FASTA format to automatically generate a comprehensive results table that quantifies the GC percentage, AT percentage, and individual base counts for every sequence ID.

GC_calculation usage

1.Upload the FASTA File Click to upload the nucleic acid sequence file (e.g., CDS sequences) you want to analyze (must be in FASTA format).

2.Click the "Start Calculation" button Submit the form to calculate the base composition (A, T, C, G) and generate the statistics table.

3.Download the Result Table (Optional) Once the calculation is complete and the table is displayed, click this button to save the detailed statistical data as a text file.

Gene_cds_exon_counter

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Gene_cds_exon_counter_setup_v1.0.zip

Description:

The Gerstr-counter is a comprehensive statistical analysis tool designed to characterize genomic structures on the Asteraceae Platform. Its primary function is to quantify key genomic elements by parsing annotation data. It automatically calculates the count and cumulative length of essential features, including Genes, CDS (Coding DNA Sequences), Exons, Introns, UTRs, and mRNAs. Users are required to upload one or multiple GFF (General Feature Format) files to initiate the analysis. The tool generates a detailed statistical report, providing researchers with immediate insights into genome composition and annotation metrics.

gene_cds_exon_counter usage

1.Upload the GFF annotation files Click to upload the gene feature files you want to analyze (supports GFF format, and multiple files can be uploaded simultaneously).

2.Click the "Submit & Calculate" button Submit the form to calculate the statistics for genes, exons, introns, and CDS, and to generate the results table.

3.Download the Result Table (Optional) Once the calculation is complete and the table is displayed, click this button to save the detailed statistical metrics as a text file.

Gene_structure

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Gene_structure_setup_v1.0.zip

Description:

Gmapping is a specialized visualization utility designed to generate detailed gene structure diagrams on the Asteraceae Platform. Its primary function is to map and display genomic features within a user-defined chromosomal region. By integrating data from an uploaded GFF file and a specific highlight file (containing target gene IDs), the tool visualizes gene arrangements based on precise inputs for the chromosome number, starting position, and termination position. This allows researchers to intuitively analyze gene locations and structural relationships within a specific genomic window.

gene_structure usage

1.Upload the GFF File Click to upload the gene annotation file containing structural information (must be in GFF format).

2.Upload the Highlight File Click to upload the file containing the list of specific gene IDs you wish to highlight in the diagram.

3.Input the Chromosome Position Enter the specific chromosome number or identifier where the target genes are located (e.g., "1").

4.Input the Starting Position Enter the numeric start coordinate on the chromosome to define the beginning of the drawing range (e.g., "1200").

5.Input the Termination Position Enter the numeric end coordinate on the chromosome to define the end of the drawing range (e.g., "162713").

6.Click the "Draw Structure" button Submit the form to generate and visualize the gene structure diagram based on the input parameters.

Genome_Mount_Rate_Calculator

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Genome_Mount_Rate_Calculator_setup_v1.0.zip

Description:

The Gam-Calculator is a specialized analytical tool designed to assess genome assembly quality on the Asteraceae Platform. Its primary function is to compute the "Genome Mount Rate," a critical metric for evaluating the completeness of chromosomal assembly. Users simply upload a genomic sequence file (FASTA format), specify the expected number of chromosomes, and define a length threshold to exclude smaller fragments. The tool processes these parameters to calculate the proportion of the genome successfully mounted to chromosomes, generating a precise statistical report instantly.

Genome_Mount_Rate_Calculator usage

1.Upload the Genomic File Click to upload the genome sequence file you wish to analyze (must be in FASTA format and less than 200MB).

2.Input the Number of Chromosomes Enter the total count of chromosomes expected in the genome (default is 12).

3.Input the Exclusion Length Threshold Enter the minimum sequence length value; sequences shorter than this threshold (e.g., small scaffolds or contigs) will be excluded from the calculation (default is 2000).

4.Click the "Calculate" button Submit the form to perform the genome mount rate calculation and generate the results table.

5.Download the Result Table (Optional) Once the calculation is complete and the table is displayed, click this button to save the genome mount rate statistics as a text file.

Heatmap

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Heatmap_setup_v1.0.zip

Description:

The Heatmap tool is a sophisticated visualization utility designed to analyze and interpret gene expression patterns on the Asteraceae Platform. It transforms complex target datasets into intuitive graphical representations, enabling researchers to identify expression dynamics and correlations at a glance. Users simply upload their data file and can fully customize the analysis by toggling Log2 transformation and selecting specific row or column clustering modes. To meet publication or presentation standards, the tool also offers flexible configuration for image dimensions and a personalized three-color gradient scheme, generating a high-resolution heatmap image instantly.

Heatmap usage

1.Upload the Target Database file Click to upload the gene expression data file (labeled as FASTA format in the interface).

2.Configure Log2 Transformation Check the box to apply a Log2 transformation to your data values (default is checked).

3.Set Column Clustering Check the box to enable hierarchical clustering for the data columns (default is checked).

4.Set Row Clustering Check the box to enable hierarchical clustering for the data rows (default is checked).

5.Select the First Color Choose a color from the picker or enter a hex code for the first part of the gradient (default is #D41B03).

6.Select the Second Color Choose a color from the picker or enter a hex code for the middle part of the gradient (default is #FEF9FA).

7.Select the Third Color Choose a color from the picker or enter a hex code for the last part of the gradient (default is #3229D0).

8.Enter Image Width Input the desired numerical width for the generated heatmap image (default is 8).

9.Enter Image Height Input the desired numerical height for the generated heatmap image (default is 10).

10.Click the "Draw Heatmap" button Submit the form to generate and visualize the gene expression heatmap.

Lollipop

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Lollipop_setup_v1.0.zip

Description:

The Lollipop Chart Generator is a sophisticated visualization tool designed for the Asteraceae Platform to assist in phylogenetic and gene expression analysis. By rendering data as "lollipops"—consisting of a marker and a connecting line—this utility offers a cleaner, more readable alternative to traditional bar charts, particularly when displaying long lists of species or genes. Users are required to upload a tab-delimited text file containing three specific columns: "Species" (identifier), "Value" (numerical data), and "Group" (taxonomic or functional classification). The tool automatically maps these groups to distinct color palettes and supports custom sorting and styling, allowing researchers to intuitively compare quantitative metrics across different biological categories.

Lollipop usage

1.Upload the Data File Click to upload the text-based data file containing your dataset (expected format includes columns for Species, Value, and Group).

2.Select the Sort Order Choose the desired sorting method for the data values from the dropdown menu (e.g., Descending or Ascending).

3.Select the Base Color Click the color picker interface to define the primary color used for the data points (dots).

4.Adjust the Dot Size Drag the slider to set the specific pixel diameter for the data points (ranging from 5 to 30 pixels).

5.Click the "Plot Lollipop" button Submit the configured form to generate and render the interactive Lollipop Chart.

P-index-calculator

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: P-index-calculator_setup_v1.0.zip

Description:

The P-Index Calculator is a specialized analytical utility developed for the Asteraceae Platform to quantify subgenome evolutionary dynamics . Its primary function is to compute the P-index, a metric used to evaluate evolutionary trends such as gene retention bias and fractionation patterns following polyploidy events. To initiate the analysis, users must upload a "Correspondence result" file (MC file) containing syntenic gene pairs. The tool provides a highly customizable environment where researchers can configure specific parameters—including Sliding Window size, Step size, and Sigma range (0-2)—to fine-tune the granularity of the calculation . This allows for the precise visualization of evolutionary dominance or sensitivity across chromosomal regions.

P-index usage

1.Upload the MC File Click to upload the correspondence result file containing collinearity data (e.g., a text file generated from previous correspondence analysis).

2.Enter the Sliding Window Input the window size for the analysis (default is 100) to define the range of genes considered in each step.

3.Enter the Step Size Input the numerical increment for the moving window (default is 2) to control the granularity of the analysis.

4.Input the Sigma Start Value Enter the starting threshold for the sigma parameter (between 0 and 2, default is 0.05).

5.Input the Sigma End Value Enter the ending threshold for the sigma parameter (between 0 and 2, default is 1.8).

6.Click the "Run Analysis" button Submit the form to start the P-Index calculation and generate the evolutionary trend analysis.

Sequence_fetch

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Sequence_fetch_setup_v1.0.zip

Description:

Seq-fetch is a dedicated sequence retrieval tool designed to streamline the extraction of genetic data on the platform. Its primary function is to perform efficient batch searches for Coding Sequences (CDS) and Peptide (PEP) sequences based on specific gene identifiers. To initiate a search, users simply select the target species from the database and input a list of Gene IDs (e.g., "Aed09g01638") into the query field. The tool instantly retrieves the corresponding sequences along with related metadata, providing a centralized interface for researchers to view, copy, and download the filtered datasets for downstream analysis.

Sequence_Fetch usage

1.Select the Target Species Choose the specific species code (e.g., Arachis duranensis - Adu) from the dropdown menu to define the source genome for sequence retrieval.

2.Input the Gene ID List Enter the specific gene identifiers (one per line, e.g., Adu01g03207) into the text area, or click the "Example" button to load sample data.

3.Click the "Submit Analysis" button Submit the form to fetch the CDS and protein sequences corresponding to the entered IDs.

Species_classification

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Species_classification_setup_v1.0.zip

Description:

The Sptaxon-accessor is a specialized auxiliary tool designed to facilitate the automated batch retrieval of taxonomic classification data. Leveraging the comprehensive NCBI database, this utility allows users to quickly obtain detailed taxonomic hierarchies based on species' Latin names. Users are simply required to upload a species list file containing the scientific names of interest, and the tool processes the input to generate specific classification information for each entry. This feature is particularly valuable for researchers performing large-scale phylogenetic analyses or biodiversity assessments on the Asteraceae Platform, eliminating the need for manual, single-species lookups.

Species_classification usage

1.Upload the Species List file Click to upload the text file containing the Latin names of the species you want to classify (one species name per line).

2.Click the "Get Classification" button Submit the form to retrieve the taxonomic classification information from NCBI in batches.

Subtree_extractor

Reference: LGRPv2: A high-value platform for the advancement of Fabaceae genomics

Download: Subtree_extractor_setup_v1.0.zip

Description:

The Subtree_extractor is a specialized phylogenetic utility designed to simplify the analysis of complex evolutionary relationships by isolating relevant data. Its primary function is to "prune" large phylogenetic trees—whether species trees or gene trees—to extract specific sub-phylogenies based on a user-defined criteria. This tool is particularly useful for researchers who need to focus on a distinct subset of taxa within a massive evolutionary tree without manually editing complex Newick files. Users are required to upload two inputs: a source phylogenetic tree file and a list containing the target species names. The tool then processes these files to generate a refined subtree that exclusively retains the specified lineages, facilitating focused evolutionary study.

Subtree_Extractor usage

1.Upload the Phylogenetic Tree file Click to upload the master tree file (usually in Newick format) containing the complete evolutionary relationships.

2.Upload the Species List file Click to upload the text file containing the specific list of species or gene names you want to extract from the main tree.

3.Click the "Extract Subtrees" button Submit the form to isolate and generate the smaller phylogenetic tree based on your provided list.

JCVI

Reference: JCVI: A versatile toolkit for comparative genomics analysis

Download: jcvi-main.zip

Installation:

# Create a virtual environment named “jcvi” using conda
conda create -y -c bioconda -c conda-forge -n jcvi python=3.10

# Activate the Environment
conda activate jcvi

# Install the last tool
conda install -c bioconda last

# Install JCVI
pip install jcvi

# Verify Installation
jcvi --version

Use Flow:

python -m jcvi.compara.catalog ortholog species1 species2
python -m jcvi.graphics.dotplot species1.species2.anchors
python -m jcvi.compara.synteny screen --minspan=30 --simple species1.species2.anchors species1.species2.anchors.new
python -m jcvi.graphics.karyotype seqids layout

Common error sets:

# Error 1: Command ‘lastdb/lastal’ not found
# Solve 1: conda install -c bioconda last

# Error 2: LaTeX/dvipng missing (plotting error)
# Solve 2: sudo apt-get install texlive texlive-latex-extra texlive-fonts-recommended dvipng

RAxML

Reference: RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies

Download: standard-RAxML-master.zip

Installation:

# Go to Alexis github repository and download the latest RAxML version. When the download is completed type:
unzip standard-RAxML-master.zip

# This will create a directory called standard-RAxML-master
# Change into this directory by typing
cd standard-RAxML-master

# Initially, we will need to compile the RAxML executables.
# To obtain the sequential version of RAxML type:
make -f Makefile.gcc

# this will generate an executable called raxmlHPC. If you then want to compile an additional version of RAxML make always sure to type rm *.o before you use another Makefile. Assume, we are using a laptop with two cores:
1. make -f Makefile.gcc
2. rm *.o
3. make -f Makefile.SSE3.gcc
4. rm *.o
5. make -f Makefile.PTHREADS.gcc
6. rm *.o
7. make -f Makefile.SSE3.PTHREADS.gcc

# Once we are done with compiling the code we can execute it locally:
./raxmlHPC

# or copy all the executables into our local directory of binary files:
cp raxmlHPC* ~/bin/

# To get an overview of available commands type:
raxmlHPC -h

Use Flow:

raxmlHPC-PTHREADS-AVX2 -f a -x 123456 -p 123456 -s example.fasta -m GTRGAMMA -N 1000 -n output
raxmlHPC-PTHREADS-AVX2 -f a -x 123456 -p 123456 -s example.fasta -m PROTGAMMAAUTO -N 1000 -n output

Common error sets:

# Error 1: RAxML_bestTree.alignment: No such file or directory
# Solve 1: Check the RAxML_info.alignment log file to troubleshoot the cause of the error.

# Error 2: Out of Memory
# Solve 2: Reduce the number of threads (e.g., from 16 to 8 threads), use the GTRCAT model (more memory-efficient than GTRGAMMA), and run in stages (first perform ML search, then Bootstrap).

OrthoFinder

Reference: OrthoFinder: phylogenetic orthology inference for comparative genomics

Download: OrthoFinder-main.zip

Installation:

# Create and activate a virtual environment
conda create -n of3_env python=3.12
conda activate of3_env

# Install OrthoFinder (including core dependencies: diamond, mcl, fastme)
conda install orthofinder

# Verify Installation (Check Version Number)
orthofinder --version

Use Flow:

orthofinder -f ./proteins/ -S diamond -M msa -T raxml -t 8

# -f ./proteins/: Specify input directory (containing FASTA files for proteins across all species);
# -S diamond: Use diamond for fast sequence alignment (default, 10-100 times faster than blastp);
# -M msa: Infer gene trees using multiple sequence alignment (MSA) methods;
# -T raxml: Construct gene trees using RAxML (requires prior installation of RAxML; raxmlHPC-AVX2 version recommended);
# -t 8: Accelerate alignment using 8 threads (adjust based on CPU core count).

Common error sets:

# Error 1: ImportError: No module named numpy or numpy.core.multiarray failed to import
# Solve 1: conda install numpy
export PYTHONPATH="$CONDA_PREFIX/lib/python3.10/site-packages:$PYTHONPATH"

# Error 2: external program called by OrthoFinder returned an error code: 255
# Solve 2: 
# Step 1: Check if RAxML is installed: `which raxmlHPC-AVX2`
# Step 2: Modify OrthoFinder's `config.json` changing the RAxML command to:
# "raxml": {
#   "program_type": "tree",
#   "cmd_line": "raxmlHPC-AVX2 -m PROTGAMMALG -p 12345 -s INPUT -n IDENTIFIER -w PATH > /dev/null",
#   "output_filename": "PATH/RAxML_bestTree.IDENTIFIER"
# }
# Step 3: Manually test the RAxML command.

# Error 3: Input file format not recognized or Sequence length mismatch
# Solve 3: Check file format: `head ./proteins/Human.faa`. Unify sequence length using MAFFT or Muscle.

KofamScan

Reference: KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold

Download: kofam_scan-1.3.0.tar.gz

Installation:

# Download KofamScan software and database
wget ftp://ftp.genome.jp/pub/tools/kofam_scan/kofam_scan-1.3.0.tar.gz
tar -xzvf kofam_scan-1.3.0.tar.gz

# Download the KEGG Database
wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz
wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz
gunzip ko_list.gz
tar -xzvf profiles.tar.gz

# Configure Environment Variables (Add KofamScan to PATH)
echo 'export PATH=/path/to/kofam_scan-1.3.0/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Use Flow:

exec_annotation -o output.txt -f mapper -p /path/profiles -k /path/ko_list -E 1e-5 --cpu 6 --tmp-dir tmp_dir input.faa

Common error sets:

# Error 1: No such file or directory: 'hmmsearch' or LoadError: cannot load such file -- parallel
# Solve 1: conda install hmmmer parallel

# Error 2: Profile database not found
# Solve 2: exec_annotation --profile /custom/path/profiles

# Error 3: Invalid input format
# Solve 3: Check file format: `head input.faa` (ensure it starts with `>` and the sequence is in protein alphabet)

trimAl

Reference: trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses

Download: trimAl_Linux_x86-64.zip

Installation:

# Create and activate the virtual environment (Python 3.8+ recommended)
conda create -n trimal_env -y
conda activate trimal_env

# Install trimAl (Bioconda channel)
conda install -c bioconda trimAl

Use Flow:

trimal -in input.pep -out output.out -gt 0.3

Common error sets:

# Error 1: g++: command not found（Linux/macOS）
# Solve 1: sudo apt-get install build-essential  # Ubuntu
# sudo yum groupinstall "Development Tools"      # CentOS

# Error 2: No such file or directory.
# Solve 2: Add the trimAl executable to the PATH.

Circos

Reference: Circos: An information aesthetic for comparative genomics

Download: circos-0.69-10.tgz

Installation:

# Download
wget https://circos.ca/distribution/circos-0.69-10.tgz
tar xzf circos-0.69-10.tgz
cd circos-0.69

# Test
cd circos-0.69/example/
../bin/circos -conf etc/circos.conf

# Check
circos -modules

Use Flow:

circos -conf circos.conf -outputdir ./output -outputfile circos.png

Common error sets:

# Error 1: Can't locate <Module>.pm in @INC.
# Solve 1: conda install -c bioconda perl-gd.

# Error 2: Configuration file error.
# Solve 2: Add karyotype = <path/to/karyotype.txt> to circos.conf.

# Error 3: Invalid data format
# Solve 3: 
# Step 1: Verify that the data file is tab-delimited;
# Step 2: Ensure each line contains four columns (chromosome ID, start position, end position, value)

Pal2NAL

Reference: PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments

Download: pal2nal.v14.tar.gz

Installation:

# Create and activate the virtual environment
conda create -n pal2nal -y -c bioconda perl bioperl
conda activate pal2nal

# Install Pal2NAL
conda install -c bioconda pal2nal

Use Flow:

pal2nal.pl pep.aln nuc.fa -output fasta -nogap -codontable universal > nuc.aln

pal2nal.pl -i sample.id pep.aln nuc.fa -output axt -codontable universal

pal2nal.pl -i sample.id pep.aln nuc.fa -codeml

Common error sets:

# Error 1: Can't locate Bio/SeqIO.pm in @INC.
# Solve 1: conda install -c bioconda perl-bioperl.

# Error 2: ERROR: inconsistency between the following pep and nuc seqs.
# Solve 2: Check file IDs and standardize ID formats using seqkit or a Python script.

# Error 3: The output result is empty.
# Solve 3: 
# Step 1: Verify FASTA format: head nuc.fa (must begin with >, sequence consists of DNA letters).
# Step 2: Ensure each protein corresponds to a unique DNA sequence ID.

gmap

Reference: GMAP: a genomic mapping and alignment program for mRNA and EST sequences

Download: gmap-gsnap-2023-12-01.tar.gz

Installation:

# Download
wget http://research-pub.gene.com/gmap/src/gmap-gsnap-2023-12-01.tar.gz
tar -zxvf gmap-gsnap-2023-12-01.tar.gz
cd gmap-gsnap-2023-12-01

# Configuration and Compilation
./configure --prefix=/path/to/install
make -j4
make install

Use Flow:

gmap_build -D /path/gmapdb -d species genome.fa
gmap -D /path/gmapdb -t 10 -d species -f gff3_gene input.cds >output.gff
gmap -t 10 -d species -f gff3_gene input.fa > output.gff

Common error sets:

# Error 1: Can't locate Bio/SeqIO.pm in @INC.
# Solve 1: conda install -c bioconda perl-bioperl.

# Error 2: chromosome lengths not found.
# Solve 2: samtools faidx reference.fa.

# Error 3: No paths found for <Seq ID>
# Solve 3: Downgrade to a stable version (e.g., 2017-11-15). Verify that the sequence format is FASTA.

GffRead

Reference: GFF Utilities: GffRead and GffCompare

Download: gffread-0.12.7.tar.gz

Installation:

# Download source code
git clone https://github.com/gpertea/gffread
cd gffread

# Compile and install
make release # Generate executable files in the current directory
sudo cp gffread /usr/local/bin/ # Add to system path

Use Flow:

gffread input.gff3 -T -o output.gtf
gffread input.gtf -o output.gff3
gffread input.gff3 -g genome.fa -x cds.fa
gffread input.gff3 -g genome.fa -y protein.fa

Common error sets:

# Error 1: Can't locate Bio/SeqIO.pm in @INC.
# Solve 1: conda install -c bioconda perl-bioperl.

# Error 2: Invalid GFF/GTF format
# Solve 2: Verify that the file conforms to the GFF3/GTF specification using `gffread -h`.

# Error 3: Sequence ID not found in genome file.
# Solve 3: Ensure that the IDs in the genome FASTA files match the seqids in the annotation file.

Gene Search

Bio-Software

JCVI

RAxML

OrthoFinder

KofamScan

trimAl

Circos

Pal2NAL

gmap

GffRead