Needleman-Wunsch global alignment
http://sourceforge.net/projects/needlemanwunsch
Isaac Turner <turner.isaac@gmail.com>
6 Dec 2011
GNU General Public License (v3 or later). See LICENSE file.
== Build ==
isaac$ make
== Run ==
Baiscs:
isaac$ ./needleman_wunsch CAGACGT CGATA
CAGACGT
C--GATA
Print alignment scores:
isaac$ ./needleman_wunsch --printscores CAGACGT CGATA
CAGACGT
C--GATA
score: -15
Read from file:
--- dna.fa file --
>seqA
ACAATAGAC
>seqB
ACGAATAGAT
>seqC
ACGTGA
CAGAT
>seqD
GTGGACG
AGTA
-----------------
isaac$ ./needleman_wunsch --printscores --file dna.fa.gz
AC-AATAGAC
ACGAATAGAT
score: -3
ACGTGACAGAT
GTGGACGAGTA
score: -5
Reading from STDIN:
isaac$ gzip -d -c dna.fa.gz | ./needleman_wunsch --stdin
AC-AATAGAC
ACGAATAGAT
ACGTGACAGAT
GTGGACGAGTA
is the same as:
isaac$ gzip -d -c dna.fa.gz | ./needleman_wunsch --file -
Set different scoring systems:
isaac$ ./needleman_wunsch --match 1 --mismatch 0 --gapopen -10 --gapextend 0 ACGTGCCCCACAGAT AGGTGGACGAGAT
== Getting help ==
Feel free to contact me to request features. Bug reports are appreciated.
isaac$ ./needleman_wunsch
usage: ./needleman_wunsch [OPTIONS] [seq1 seq2]
Needleman-Wunsch optimal global alignment (maximises score).
Takes a pair of sequences on the command line, reads from a
file and from sequence piped in. Can read gzip files and FASTA.
OPTIONS:
--file <file> Sequence file reading with gzip support
--stdin Read from STDIN (same as '--file -')
--case_sensitive Case sensitive character comparison
--scoring <PAM30|PAM70|BLOSUM80|BLOSUM62>
--substitution_matrix <file> see details for formatting
--substitution_pairs <file> see details for formatting
--match <score> default: 1
--mismatch <score> default: -2
--gapopen <score> default: -4
--gapextend <score> default: -1
--freestartgap No penalty for gap at start of alignment
--freeendgap No penalty for gap at end of alignment
--printscores Print optimal alignment scores
--printfasta Print fasta header lines
--pretty Print with a descriptor line
--colour Print with in colour
--zam A funky type of output
DETAILS:
* For help choosing scoring, see the README file.
* Gap (of length N) penalty is: (open+N*extend)
* To do alignment without affine gap, set '--gapopen 0'.
* Scoring files should be matrices, with entries separated
by a single character or whitespace. See files in the
'scores' directory for examples.
turner.isaac@gmail.com 06 Dec 2011
== Scoring systems ==
Proteins:
Query Length Substitution Matrix Gap Costs (gap_open,gap_extend)
<35 PAM-30 (9,1)
35-50 PAM-70 (10,1)
50-85 BLOSUM-80 (10,1)
85 BLOSUM-62 (10,1)
[table from http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html]
gap (of length N) penalty: gap_open + N*gap_extend
NCBI BLAST Quote:
Many nucleotide searches use a simple scoring system that consists of a "reward"
for a match and a "penalty" for a mismatch. The (absolute) reward/penalty ratio
should be increased as one looks at more divergent sequences. A ratio of 0.33
(1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5
(1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1)
is best for sequences that are 75% conserved [1].
[from: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Reward-penalty]
NCBI Gap (open, extend) values:
-5, -2
-2, -2
-1, -2
-0, -2
-3, -1
-2, -1
-1, -1
Our default (for now) are:
gap_open/gap_extend: (-4,-1)
match/mismatch: (1,-2)
== Use Needleman-Wunsch alignment in your own code ==
In order to use needleman_wunsch with your own c/c++ programs, you need the
following files:
needleman_wunsch.c
needleman_wunsch.h
alignment_scoring.c
alignment_scoring.h
alignment.c
alignment.h
these are all in the release, so added -I <path_to_dir> to gcc will work and the
paths to the .c files. Like so:
gcc -I <path_to_dir> <path_to_dir>/needleman_wunsch.c <path_to_dir>/alignment.c
<path_to_dir>/alignment_scoring.c <yourfiles>
Use the code below and include needleman_wunsch.c in your compile command.
You also need a copy of uthash.h in your path. It's a single file downloadable
from here: http://uthash.sourceforge.net ...and it's great!
-----------------------------
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "needleman_wunsch.h"
void align(char* seq_a, char* seq_b)
{
// Variables to store alignment result
char *alignment_a, *alignment_b;
// malloc the above variables
// (seq1 and seq2 are used to figure out how much memory may be needed)
nw_alloc_mem(seq_a, seq_b, &alignment_a, &alignment_b);
// Decide on scoring
int match = 1;
int mismatch = -2;
int gap_open = -4;
int gap_extend = -1;
// Don't penalise gaps at the start
// ACGATTT
// ----TTT would score +3 (when match=+1)
char no_start_gap_penalty = 1;
// ..or gaps at the end e.g.
// ACGATTT
// ACGA--- would score +4 (when match=+1)
char no_end_gap_penalty = 1;
// Compare character case-sensitively (usually set to 0 for DNA etc)
char case_sensitive = 0;
SCORING_SYSTEM* scoring = scoring_create(match, mismatch,
gap_open, gap_extend,
no_start_gap_penalty,
no_end_gap_penalty,
case_sensitive);
// Add some special cases
// x -> y means x in seq1 changing to y in seq2
scoring_add_mutation(scoring, 'a', 'c', -2); // a -> c give substitution score -2
scoring_add_mutation(scoring, 'c', 'a', -1); // c -> a give substitution score -1
// We could also prohibit the aligning of characters not given as special cases
// scoring->use_match_mismatch = 0;
int score = needleman_wunsch(seq_a, seq_b, alignment_a, alignment_b, scoring);
printf("seqA: %s\n", alignment_a);
printf("seqB: %s\n", alignment_b);
printf("alignment score: %i\n", score);
// Free memory used to store scoring preferences
scoring_free(scoring);
free(alignment_a);
free(alignment_b);
}
# Compiled and tested 22 Jan 2012
-----------------------------
== Development ==
Short term goals:
- none - suggest some!
Long term goals:
- allow inversions