Download Latest Version needleman_wunsch-0.3.5.tar.gz (39.3 kB)
Email in envelope

Get an email when there's a new version of NeedlemanWunsch

Home
Name Modified Size InfoDownloads / Week
needleman_wunsch-0.3.5.tar.gz 2013-01-27 39.3 kB
needleman_wunsch-0.3.4.tar.gz 2012-01-22 37.8 kB
README 2012-01-22 6.5 kB
needleman_wunsch-0.3.3.tar.gz 2011-12-06 31.2 kB
needleman_wunsch-0.2.3.tar.gz 2011-11-27 11.4 kB
Totals: 5 Items   126.1 kB 21
Needleman-Wunsch global alignment
http://sourceforge.net/projects/needlemanwunsch
Isaac Turner <turner.isaac@gmail.com>
6 Dec 2011

GNU General Public License (v3 or later).  See LICENSE file.

== Build ==

isaac$ make

== Run ==

Baiscs:

isaac$ ./needleman_wunsch CAGACGT CGATA
CAGACGT
C--GATA

Print alignment scores:

isaac$ ./needleman_wunsch --printscores CAGACGT CGATA
CAGACGT
C--GATA
score: -15

Read from file:

--- dna.fa file --
>seqA
ACAATAGAC
>seqB
ACGAATAGAT
>seqC
ACGTGA
CAGAT
>seqD
GTGGACG
AGTA
-----------------

isaac$ ./needleman_wunsch --printscores --file dna.fa.gz
AC-AATAGAC
ACGAATAGAT
score: -3

ACGTGACAGAT
GTGGACGAGTA
score: -5


Reading from STDIN:

isaac$ gzip -d -c dna.fa.gz | ./needleman_wunsch --stdin
AC-AATAGAC
ACGAATAGAT

ACGTGACAGAT
GTGGACGAGTA

is the same as:
isaac$ gzip -d -c dna.fa.gz | ./needleman_wunsch --file -

Set different scoring systems:

isaac$ ./needleman_wunsch --match 1 --mismatch 0 --gapopen -10 --gapextend 0 ACGTGCCCCACAGAT AGGTGGACGAGAT

== Getting help ==

Feel free to contact me to request features.  Bug reports are appreciated.  

isaac$ ./needleman_wunsch
usage: ./needleman_wunsch [OPTIONS] [seq1 seq2]
  Needleman-Wunsch optimal global alignment (maximises score).  
  Takes a pair of sequences on the command line, reads from a
  file and from sequence piped in.  Can read gzip files and FASTA.  

  OPTIONS:
    --file <file>        Sequence file reading with gzip support
    --stdin              Read from STDIN (same as '--file -')
    --case_sensitive     Case sensitive character comparison

    --scoring <PAM30|PAM70|BLOSUM80|BLOSUM62>
    --substitution_matrix <file>  see details for formatting
    --substitution_pairs <file>   see details for formatting

    --match <score>      default: 1
    --mismatch <score>   default: -2
    --gapopen <score>    default: -4
    --gapextend <score>  default: -1

    --freestartgap       No penalty for gap at start of alignment
    --freeendgap         No penalty for gap at end of alignment

    --printscores        Print optimal alignment scores
    --printfasta         Print fasta header lines
    --pretty             Print with a descriptor line
    --colour             Print with in colour
    --zam                A funky type of output

 DETAILS:
  * For help choosing scoring, see the README file. 
  * Gap (of length N) penalty is: (open+N*extend)
  * To do alignment without affine gap, set '--gapopen 0'.
  * Scoring files should be matrices, with entries separated
    by a single character or whitespace.  See files in the
    'scores' directory for examples.

  turner.isaac@gmail.com  06 Dec 2011

== Scoring systems ==

Proteins:
Query Length    Substitution Matrix   Gap Costs (gap_open,gap_extend)
<35             PAM-30                (9,1)
35-50           PAM-70                (10,1)
50-85           BLOSUM-80             (10,1)
85              BLOSUM-62             (10,1)
[table from http://www.ncbi.nlm.nih.gov/blast/html/sub_matrix.html]

gap (of length N) penalty: gap_open + N*gap_extend

NCBI BLAST Quote:
Many nucleotide searches use a simple scoring system that consists of a "reward"
for a match and a "penalty" for a mismatch. The (absolute) reward/penalty ratio
should be increased as one looks at more divergent sequences. A ratio of 0.33
(1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5
(1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1)
is best for sequences that are 75% conserved [1].
[from: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#Reward-penalty]

NCBI Gap (open, extend) values:
-5, -2
-2, -2
-1, -2
-0, -2
-3, -1
-2, -1
-1, -1

Our default (for now) are:
gap_open/gap_extend: (-4,-1)
match/mismatch: (1,-2)

== Use Needleman-Wunsch alignment in your own code ==

In order to use needleman_wunsch with your own c/c++ programs, you need the
following files:

needleman_wunsch.c
needleman_wunsch.h
alignment_scoring.c
alignment_scoring.h
alignment.c
alignment.h

these are all in the release, so added -I <path_to_dir> to gcc will work and the
paths to the .c files.  Like so:

gcc -I <path_to_dir> <path_to_dir>/needleman_wunsch.c <path_to_dir>/alignment.c
<path_to_dir>/alignment_scoring.c <yourfiles>

Use the code below and include needleman_wunsch.c in your compile command.
You also need a copy of uthash.h in your path.  It's a single file downloadable
from here: http://uthash.sourceforge.net ...and it's great!

-----------------------------

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "needleman_wunsch.h"

void align(char* seq_a, char* seq_b)
{
  // Variables to store alignment result
  char *alignment_a, *alignment_b;

  // malloc the above variables
  // (seq1 and seq2 are used to figure out how much memory may be needed)
  nw_alloc_mem(seq_a, seq_b, &alignment_a, &alignment_b);

  // Decide on scoring
  int match = 1;
  int mismatch = -2;
  int gap_open = -4;
  int gap_extend = -1;
  
  // Don't penalise gaps at the start
  // ACGATTT
  // ----TTT would score +3 (when match=+1)
  char no_start_gap_penalty = 1;
  
  // ..or gaps at the end e.g.
  // ACGATTT
  // ACGA--- would score +4 (when match=+1)
  char no_end_gap_penalty = 1;

  // Compare character case-sensitively (usually set to 0 for DNA etc)
  char case_sensitive = 0;

  SCORING_SYSTEM* scoring = scoring_create(match, mismatch,
                                           gap_open, gap_extend,
                                           no_start_gap_penalty,
                                           no_end_gap_penalty,
                                           case_sensitive);

  // Add some special cases
  // x -> y means x in seq1 changing to y in seq2
  scoring_add_mutation(scoring, 'a', 'c', -2); // a -> c give substitution score -2
  scoring_add_mutation(scoring, 'c', 'a', -1); // c -> a give substitution score -1

  // We could also prohibit the aligning of characters not given as special cases
  // scoring->use_match_mismatch = 0;

  int score = needleman_wunsch(seq_a, seq_b, alignment_a, alignment_b, scoring);

  printf("seqA: %s\n", alignment_a);
  printf("seqB: %s\n", alignment_b);
  printf("alignment score: %i\n", score);

  // Free memory used to store scoring preferences
  scoring_free(scoring);

  free(alignment_a);
  free(alignment_b);
}

# Compiled and tested 22 Jan 2012
-----------------------------

== Development ==

Short term goals:
- none - suggest some!

Long term goals:
- allow inversions
Source: README, updated 2012-01-22