Bioinformatics in Molecular Anthropology
Todd Disotell and Anthony Di Fiore
Researchers at NYU’s Molecular Anthropology Laboratory, part of the Department of Anthropology (Faculty of Arts and Science), use a combination of molecular genetic, bioinformatic, and computational technologies to study a wide array of issues relating to the evolution and behavior of human and non-human primates.
Projects at the Lab headed by Todd Disotell, Clifford Jolly, and Anthony Di Fiore include studies of primate cooperative behavior and mating systems; inference of evolutionary relationships among primate species; investigation of interspecific and intergeneric hybridization; reconstruction of human population histories; inquiries into conservation genetics of threatened and endangered primate species; studies of evolution of the primate immune system; research into the coevolution of the nuclear and mitochondrial genomes; and even identification of feeding sites.
Bioinformatic and computational technologies are key to all of these studies. Genetic databases such as GenBank are used as sources of comparative DNA sequences and to identify and develop new genetic markers. A wide variety of software is employed in analyzing the large volumes of genetic data generated by researchers in the Laboratory. Sophisticated programs are needed to collect, assemble, and align DNA sequences; infer evolutionary relationships among species; estimate the kin relationships among individuals within populations; identify likely interspecific hybrids; look for genetic structure within populations; and search for evidence of natural selection in genes and coevolution among sets of genes and genomes. Currently, Internet connectivity to the various international databases and access to high-performance computing resources are as important as laboratory bench space and instrumentation. In fact, one of the most important pieces of equipment in the Laboratory is an Apple rack-mounted Xserve Cluster with 16 X 2.0GHz dual-processor Xserves and a three-Terabyte disk array, which Dr. Disotell worked with ITS to acquire.
African Old World monkeys (L-R): olive baboon, blue monkey, and black and white colobus.
Recent Discoveries at NYU
Over the last several years, new subspecies of gorillas and chimpanzees have been proposed, based upon molecular analyses carried out in conjunction with NYU’s Molecular Anthropology Laboratory. More recent projects using bioinformatic approaches have revealed a number of interesting patterns in primate evolutionary history. Within the guenons, a group of colorful small monkeys that live throughout Africa, an extensive molecular survey of almost all the guenon species has demonstrated a close evolutionary relationship among all terrestrial species, to the exclusion of all arboreal species. This was rather unexpected, since it was thought that it was a relatively easy evolutionary transition to climb down from or up into the trees. Within the same group of African monkeys, we discovered that a “dwarf” species, which maintained many primitive morphological traits, was actually more closely related to a very different group of guenons. Among the groups of baboons, which vary widely in size, shape, coat color, and even behavior, we have discovered extensive hybridization.
Molecular Anthropology
in vitro
Genomes currently mapped or in the process of being sequenced.
Most molecular studies carried out in the Laboratory begin with either the characterization of the DNA sequence or the determination of the alleles present in an individual primate. Sources of DNA can be blood, tissue, saliva, bones, teeth, hair, and feces, depending upon the project and the availability of primate biomaterials. Once DNA is extracted, the polymerase chain reaction (PCR) is carried out to amplify millions of copies of the region of interest. The amplified products can then be characterized as to the presence or absence of a particular allele, or the sizes of the alleles can be determined via capillary electrophoresis. Capillary electrophoresis involves passing an electrical charge through a gel-like medium inside a capillary, which causes the DNA to migrate at a rate based upon the length of the molecule amplified. Fluorescent dyes attached to the DNA molecular material are excited by a laser, and a CCD camera captures their intensity as they migrate past it in the capillary. Sophisticated computer algorithms are then brought into play to calculate the size and quantity of the migrating DNA molecules. These data are then transformed into individual allele sizes. This information can be used to determine the presence of a particular sequence, such as a SINE or an endogenous retrovirus at a specific spot in the genome. The pattern of allele sizes at multiple loci (specific spots in the genome), called microsatellites, can be used to determine relatedness between individuals or even individual identification, as in forensic contexts.
After sequencing reactions are carried out, similar techniques of capillary electrophoresis can determine the precise sequence of a region of interest. Multiple overlapping sequences can be stitched together into longer regions, even up to the length of the whole genome of an organism. The Laboratory regularly sequences the entire 16,500 base pair mitochondrial genomes of various primate species. All of these data need to be stored, organized, and made easily searchable and comparable.
Hundreds of hours of laboratory work and tens of thousands of dollars in supplies and salaries are saved by applying bioinformatic approaches to the design of experiments and search for informative regions of the genome for characterization before the first reagent is expended. Basic genomic search tools such as BLAST (Basic Local Alignment Search Tool) are used numerous times daily to find similar sequences in an organism’s genome or in other organisms. In the search for microsatellites — highly variable stretches of DNA that can be used to infer relatedness between populations and amongst individuals within a population — the presence of multiple primate genome sequences in the databases can be utilized. A microsatellite is a short segment of DNA composed of a variable number of repetitive stretches usually between two and five bases long. For example, the sequences ATATATATAT and CGCCGCCGCCGCCGCCGC each consist of five and six tandem repeats of the sequences AT and CGC, respectively. Such sequence patterns or motifs can be quickly scanned by various algorithms to find every location where they exist in a genomic sequence. Unfortunately, many of the primate species studied at NYU are poorly characterized genetically.
Molecular Anthropology in silico
The rapidly expanding public databases of DNA sequences now allow us to use a bioinformatic approach to detect microsatellites in species that are closely related to those that need to be characterized. To do so, we use freely available bioinformatic software modules to automatically scan the databases for new informative loci as sequences are deposited in GenBank. First, all new primate DNA sequences are downloaded to a local server at regular intervals. Each new sequence is then scanned for repeat motifs that characterize microsatellite regions with a program from the EMBOSS (European Molecular Biology Open Software Suite) suite of applications. Then, potentially informative loci are fed into a second application that designs molecular probes to detect and characterize the markers in the laboratory. These molecular probes can then be verified at the bench with real samples. We are currently in the process of automating the process using Perl scripts.
| Number of Species | Possible Trees |
| 2 | 1 |
| 3 | 3 |
| 4 | 15 |
| 5 | 105 |
| 6 | 954 |
| 7 | 10,395 |
| 8 | 135,135 |
| 9 | 2,027,025 |
| 10 | 34,459,425 |
| 50 | 3 x 1074 |
| 100 | 2 x 10182 |
In phylogenetic analyses, the number of rooted evolutionary trees that need to be examined is determined by NR = (2n-3)!/[2n-2(n-2)!], where n is the number of species.
Variable microsatellite loci can then be used to characterize the population variation and structure in order to better inform decisions regarding conservation priorities and degree of effort. In a project in conjunction with researchers from Columbia University, we are using such markers to estimate kin relationships among Kenyan blue monkeys to determine whether the number of kin within a group influences the probability of territorial defense. With researchers from the State University of New York at Stony Brook, we are using similar markers to determine if female kin sometimes migrate together. Alternatively, are they more likely to join groups where kin have already entered, or do females try to avoid their kin when joining new groups? One student, in collaboration with colleagues in Nigeria, is carrying out a census of a shy and rare population of gorillas by repeatedly sampling the DNA in feces left behind in night nests. Individual DNA fingerprints are gathered until no new individuals are found, unobtrusively yielding a complete portrait of the population. Other variable markers are being used to try to determine from which tree in a large territory a monkey has eaten. This is done by sampling the DNA in seeds extracted from the monkey’s feces and comparing it to all of the trees in its feeding territory which have individual DNA signatures.
Another class of bioinformatic-intensive analyses, called phylogenetic analysis, revolves around sequence data used to infer evolutionary relationships and the effects of natural selection on different components of the genome. One of the first steps in carrying out a phylogenetic analysis to infer evolutionary relationships from DNA sequences involves aligning the sequences from multiple individuals or species. Sequences need to be aligned because small to large insertions and deletions of DNA bases occur over evolutionary time. While this is relatively straightforward when only two sequences have to be aligned to each other, when multiple sequences are involved, the computational complexity increases dramatically. Multiple sequence alignment, as well as many methods of phylogenetic analysis, are called NP-hard (Nondeterministic Polynomial-time hard) problems. These involve a great amount of computational power, with either long run times on individual processors working in the background or parallelized versions running on a cluster of processors.
Phylogenetic analyses (which involve inferring both the structure of an evolutionary tree and the dates at which different lineages split from each other) have become increasingly complex and computationally time consuming. Parsimony and likelihood approaches often require evaluating tens of millions to billions of possible trees in order to find the one(s) that provide the best fit to various models of evolution. Bayesian methods employing a Markov Chain Monte Carlo (MCMC) approach, while less computationally time consuming than searching through the entire tree-space, nevertheless require long run times. Parallelization of these techniques, however, now allows some of these analyses to be carried out in days, rather than weeks.
Bioinformatics and computational science are also used in the NYU Molecular Anthropology Laboratory to understand the co-evolution of primates and their pathogens. Several software packages (e.g., PAML,1 HYPHY2) are being used to test for evidence of adaptive evolution and natural selection at the DNA level. One project is examining the genes involved in the immune system of African monkeys, to see if they show evidence of adaptation to prolonged exposure to the simian immunodeficiency virus (SIV). Because positive or adaptive selection suggests that some changes are beneficial to the organism, examination of specific amino acids under positive selection observed in these cases may provide genetic evidence of co-evolution between African monkeys and SIV, and explain why some primates are better adapted to SIV infection. Since HIV and SIV are basically the same viruses in humans and our non-human primate relatives, understanding how the latter fend off infection, or at least its effects, may benefit HIV research.
Clearly, the combination of bioinformatic techniques with other analytic approaches, when applied to DNA sequences, has become as critical as the laboratory instrumentation used to collect the data.
Footnotes
Author Biographies
Dr. Disotell is a Professor of Anthropology in the Faculty of Arts and Science. His research
is centered upon the theme of primate and human evolution, at all levels from the populational to the supraordinal.
Dr. Di Fiore is an Associate Professor of Anthropology in the Faculty of Arts and Science.
His research focuses on the comparative socioecology, mating systems, and population genetic structures of primates, particularly of the neotropics.



