Matthew Garrett ([personal profile] mjg59) wrote2011-10-07 09:37 am

Margaret Dayhoff

It's become kind of a cliché for me to claim that the reason I'm happy working on ACPI and UEFI and similarly arcane pieces of convoluted functionality is that no matter how bad things are there's at least some form of documentation and there's a well-understood language at the heart of them. My PhD was in biology, working on fruitflies. They're a poorly documented set of layering violations which only work because of side-effects at the quantum level, and they tend to die at inconvenient times. They're made up of 165 million bases of a byte code language that's almost impossible to bootstrap[1] and which passes through an intermediate representations before it does anything useful[2]. It's an awful field to try to do rigorous work in because your attempts to impose any kind of meaningful order on what you're looking at are pretty much guaranteed to be sufficiently naive that your results bear a resemblance to reality more by accident than design.

The field of bioinformatics is a fairly young one, and because of that it's very easy to be ignorant of its history. Crick and Watson (and those other people) determined the structure of DNA. Sanger worked out how to sequence proteins and nucleic acids. Some other people made all of these things faster and better and now we have huge sequence databases that mean we can get hold of an intractable quantity of data faster than we could ever plausibly need to, and what else is there to know?

Margaret Dayhoff graduated with a PhD in quantum chemistry from Columbia, where she'd performed computational analysis of various molecules to calculate their resonance energies[3]. The next few years involved plenty of worthwhile research that aren't relevant to the story, so we'll (entirely unfairly) skip forward to the early 60s and the problem of turning a set of sequence fragments into a single sequence. Dayhoff worked on a suite of applications called "Comprotein". The original paper can be downloaded here, and it's a charming look back at a rigorous analysis of a problem that anyone in the field would take for granted these days. Modern fragment assembly involves taking millions of DNA sequence reads and assembling them into an entire genome. In 1960, we were still at the point where it was only just getting impractical to do everything by hand.

This single piece of software was arguably the birth of modern bioinformatics, the creation of a computational method for taking sequence data and turning it into something more useful. But Dayhoff didn't stop there. The 60s brought a growing realisation that small sequence differences between the same protein in related species could give insight into their evolutionary past. In 1965 Dayhoff released the first edition of the Atlas of Protein Sequence and Structure, containing all 65 protein sequences that had been determined by then. Around the same time she developed computational methods for analysing the evolutionary relationship of these sequences, helping produce the first computationally generated phylogenetic tree. Her single-letter representation of amino acids was born of necessity[4] but remains the standard for protein sequences. And the atlas of 65 protein sequences developed into the Protein Information Resource, a dial-up database that allowed researchers to download the sequences they were interested in. It's now part of UniProt, the world's largest protein database.

Her contributions to the field were immense. Every aspect of her work on bioinformatics is present in the modern day — larger, faster and more capable, but still very much tied to the techniques and concepts she pioneered. And so it still puzzles me that I only heard of her for the first time when I went back to write the introduction to my thesis. She's remembered today in the form of the Margaret Oakley Dayhoff award for women showing high promise in biophysics, having died of a heart attack at only 57.

I don't work on fruitflies any more, and to be honest I'm not terribly upset by that. But it's still somewhat disconcerting that I spent almost 10 years working in a field so defined by one person that I knew so little about. So my contribution to Ada Lovelace day is to highlight a pivotal woman in science who heavily influenced my life without me even knowing.

[1] You think it's difficult bringing up a compiler on a new architecture? Try bringing up a fruitfly from scratch.
[2] Except for the cases where the low-level language itself is functionally significant, and the cases where the intermediate representation is functionally significant.
[3] Something that seems to have involved a lot of putting punch cards through a set of machines, getting new cards out, and repeating. I'm glad I live in the future.
[4] The three-letter representation took up too much space on punch cards
brainwane: My smiling face, including a small gold bindi (Default)

[personal profile] brainwane 2011-10-07 05:52 pm (UTC)(link)
having died of a heart attack at only 57

Oh that's a punch in the gut. We missed out on so much.

Thank you for telling me about Dr. Dayhoff. I'm so glad she got to contribute so much.

PAM

(Anonymous) 2011-10-11 07:02 am (UTC)(link)
Well, nowadays bioinformatics students usually come across Dayhoff when PAM matrices (http://en.wikipedia.org/wiki/Substitution_matrix#PAM) are introduced in a first sequence comparison lecture. Probably in the 2nd or 3rd semester.

Mycoplasma laboratorium

(Anonymous) 2012-02-03 06:09 am (UTC)(link)
Bring up a fruitfly from scratch? We can't even bring up a reduced instruction set bacterium from scratch yet!

Re: Mycoplasma laboratorium

(Anonymous) 2012-02-03 06:26 am (UTC)(link)
Sorry, hit "post" to soon. I meant to say, "bring up a RISC bacterium without cheating". The code is 99% stolen, but it's entirely hand-assembled instead of being hand-linked natural modules. Nevertheless a major milestone.

Even a protein compiler, which would take a description of a tertiary structure (loops, beta-sheets, etc.) and generate amino acid or DNA sequences is still science fiction as far as I know. But bioinformatics is moving fast, I could be wrong.