A promoter is a region of DNA that effects how much of a certain gene is made. In Synthetic Biology, a promoter is a very common design element in genetic circuits. It is still not entirely known how the DNA sequence of a promoter directly correlates to its strength. The Segal lab, in 2012, published a data set containing the raw expression data of 6,500 mutated promoter sequences. I was interested in seeing if I could use the data to model the strength of a promoter based on its sequence using a Convolutional Neural Net - and if I could then use the model to design 'super-promoters'.
I was able to design a CNN to fit the data pretty nicely. After around 30 generations of random mutations and simple gradient ascent the CNN converges to a set of promoter sequences that are twice as strong as the strongest promoter in the original data set.
It would be interesting to build some of those 'super-promoters' and measure the result. The model may have learned some motifs that might increase expression level, but it might still be missing other motifs that would otherwise reduce expression. Iteratively training the model with some of these new, predicted designs could expose missing motifs.