python - Motif search with Gibbs sampler -
i beginner in both programming , bioinformatics. so, appreciate understanding. tried develop python script motif search using gibbs sampling explained in coursera class, "finding hidden messages in dna". pseudocode provided in course is:
gibbssampler(dna, k, t, n) randomly select k-mers motifs = (motif1, …, motift) in each string dna bestmotifs ← motifs j ← 1 n ← random(t) profile ← profile matrix constructed strings in motifs except motifi motifi ← profile-randomly generated k-mer in i-th sequence if score(motifs) < score(bestmotifs) bestmotifs ← motifs return bestmotifs
problem description:
code challenge: implement gibbssampler.
input: integers k, t, , n, followed collection of strings dna. output: strings bestmotifs resulting running gibbssampler(dna, k, t, n) 20 random starts. remember use pseudocounts!
sample input:
8 5 100 cgcccctctcgggggtgttcagtaaccggcca gggcgaggtatgtgtaagtgccaaggtgccag tagtaccgagaccgaaagaagtatacaggcgt tagatcaagtttcaggtgcacgtcggtgaacc aatccaccagctccacgtgcaatgttggccta
sample output:
tctcgggg ccaaggtg tacaggcg ttcaggtg tccacgtg
i followed pseudocode best of knowledge. here code:
def buildprofilematrix(dnamatrix): profilematrix = [[1 x in xrange(len(dnamatrix[0]))] x in xrange(4)] indices = {'a':0, 'c':1, 'g': 2, 't':3} seq in dnamatrix: in xrange(len(dnamatrix[0])): profilematrix[indices[seq[i]]][i] += 1 probmatrix = [[float(x)/sum(zip(*profilematrix)[0]) x in y] y in profilematrix] return probmatrix def profilerandomgenerator(profile, dna, k, i): indices = {'a':0, 'c':1, 'g': 2, 't':3} score_list = [] x in xrange(len(dna[i]) - k + 1): probability = 1 window = dna[i][x : k + x] y in xrange(k): probability *= profile[indices[window[y]]][y] score_list.append(probability) rnd = uniform(0, sum(score_list)) current = 0 z, bias in enumerate(score_list): current += bias if rnd <= current: return dna[i][z : k + z] def score(motifs): profilematrix = [[0 x in xrange(len(motifs[0]))] x in xrange(4)] indices = {'a':0, 'c':1, 'g': 2, 't':3} seq in motifs: in xrange(len(motifs[0])): profilematrix[indices[seq[i]]][i] += 1 score = len(motifs)*len(motifs[0]) - sum([max(x) x in zip(*profilematrix)]) return score random import randint, uniform def gibbssampler(k, t, n): dna = ['cgcccctctcgggggtgttcagtaaccggcca', 'gggcgaggtatgtgtaagtgccaaggtgccag', 'tagtaccgagaccgaaagaagtatacaggcgt', 'tagatcaagtttcaggtgcacgtcggtgaacc', 'aatccaccagctccacgtgcaatgttggccta'] motifs = [] in [randint(0, len(dna[0])-k) x in range(len(dna))]: j = 0 kmer = dna[j][i : k+i] j += 1 motifs.append(kmer) bestmotifs = [] s_best = float('inf') in xrange(n): x = randint(0, t-1) motifs.pop(x) profile = buildprofilematrix(motifs) motif = profilerandomgenerator(profile, dna, k, x) motifs.append(motif) s_motifs = score(motifs) if s_motifs < s_best: s_best = s_motifs bestmotifs = motifs return [s_best, bestmotifs] k, t, n =8, 5, 100 best_motifs = [float('inf'), none] # repeat gibbs sampler search 20 times. repeat in xrange(20): current_motifs = gibbssampler(k, t, n) if current_motifs[0] < best_motifs[0]: best_motifs = current_motifs # print , save answer. print '\n'.join(best_motifs[1])
unfortunately, code never gives same output solved example. besides, while trying debug code found weird scores define mismatches between motifs. however, when tried run score function separately, worked perfectly.
each time run script, output changes, anyway here example of 1 of outputs input present in code:
example output of code
tatgtgta tatgtgta tatgtgta ggtgttca tatacagg
could please me debug code?!! spent whole day trying find out what's wrong although know might silly mistake made, eye failed catch it.
thank all!!
finally, found out wrong in code! in line 54:
motifs.append(motif)
after randomly removing 1 of motifs, followed building profile out of these motifs randomly selecting new motif based on profile, should have added selected motif in same position before removal not appended end of motif list.
now, correct code is:
motifs.insert(x, motif)
the new code worked expected.
Comments
Post a Comment