Abstract
We present new methodology for calculating sampling distributions of singlenucleotide polymorphism (SNP) frequencies in populations with timevarying size. Our approach is based on deriving analytical expressions for frequencies of SNPs. Analytical expressions allow for computations that are faster and more accurate than Monte Carlo simulations. In contrast to other articles showing analytical formulas for frequencies of SNPs, we derive expressions that contain coefficients that do not explode when the genealogy size increases. We also provide analytical formulas to describe the way in which the ascertainment procedure modifies SNP distributions. Using our methods, we study the power to test the hypothesis of exponential population expansion vs. the hypothesis of evolution with constant population size. We also analyze some of the available SNP data and we compare our results of demographic parameters estimation to those obtained in previous studies in population genetics. The analyzed data seem consistent with the hypothesis of past population growth of modern humans. The analysis of the data also shows a very strong sensitivity of estimated demographic parameters to changes of the model of the ascertainment procedure.
A lot of research has been done to develop methods for discovery of singlenucleotide polymorphisms (SNP) and to characterize distributions of SNPs across the genome (Collinset al. 1997; Wanget al. 1998; Cargillet al. 1999; Marthet al. 1999; PicoultNewberget al. 1999; Altshuleret al. 2000). SNP data have already been used in association studies of complex diseases (Boerwinkleet al. 1996; Halushkaet al. 1999; Bonnenet al. 2000; Trikkaet al. 2002), and it is believed that eventually they will enable creation of fine genetic maps for complex traits analysis (Kruglyak 1999; Rish 2000). Databases, like that of the SNP Consortium, at http://snp.cshl.org, contain massive amounts of data on positions of SNPs in the human genome, but it is likely that most of these SNPs are very rare and therefore of limited value in gene association studies. Estimates of distributions of expected relative frequencies of SNPs result from studies that use population genetics models, e.g., Durrett and Limic (2001) and Wang et al. (1998), and the predicted excess of rare alleles is explained as resulting from expansion of human populations.
Using population genetics methods to model and analyze SNPs opens an area for investigating problems like predicting frequencies of SNPs under various demographic scenarios, inferring demographic parameters and history from sampling frequencies of SNPs, comparing estimates obtained on the basis of SNP data to those obtained with other methods, and evaluating efficiency of using SNP data for estimation of population parameters. Several interesting studies were carried out in this area. Studies by Durrett and Limic (2001) and Wang et al. (1998) estimated frequencies of SNPs under the hypothesis of population growth. A problem of how sampling frequencies of SNPs are influenced by ascertainment procedures was investigated by Eberle and Kruglyak (2000), Yang et al. (2000), and Renwick et al. (2002). Using SNPs for estimation of the scaled product parameter θ= 4N_{e}μ of the effective population size N_{e} and mutation rate μ, under assumption of constant population size, was studied by Kuhner et al. (2000). They took into account various hypotheses of spatial (chromosomal) distributions of SNPs such as complete or partial linkage or occurrence of linked segments of nonrecombining SNPs and, on the basis of extensive simulations, evaluated accuracy of estimates and possible sources of bias. Studies by Nielsen (2000) and Wakeley et al. (2001) were devoted to detection of signatures of human population growth in SNP data. Nielsen (2000) fitted the scenario of exponential expansion to SNP data of PicoultNewberg et al. (1999). Wakeley et al. (2001) used the model of stepwise change of the population size with population subdivision (Wakeley 2001). They fitted their model to SNP data from Wang et al. (1998), Cargill et al. (1999) and Altshuler et al. (2000). Parameterspace regions corresponding to the highest likelihoods were not inconsistent with the hypothesis of population growth. Moreover, if the ascertainment bias was not considered, less realistic shapes of parameter regions were obtained. Comparison of cases in which population substructure was not considered with those in which it was considered seems to support the latter scenario. To evaluate SNP frequencies, these studies used the standard coalescence approach and Monte Carlo simulations.
Sampling distributions of SNP frequencies in populations with timevarying size can be calculated with the use of analytical expressions for the expected lengths of branches in the coalescence tree derived in the articles by Griffiths and Tavare (1998), Wooding and Rogers (2002), and Polanski et al. (2003). Analytical expressions allow computations, which are faster and more accurate than Monte Carlo simulations. However, the approaches shown in the articles by Griffiths and Tavare (1998), Wooding and Rogers (2002), and Polanski et al. (2003) suffer from one common difficulty, numerical instability for larger genealogies. When the analyzed genealogy size is >50, these analytical methods are either inapplicable or difficult to apply, due to the explosion of coefficients in equations. Wooding and Rogers (2002) give a method to stabilize numerical computations, which is valid for the case where effective population size changes in a stepwise manner. Here we show another approach, which is more general in the sense that it does not require assumption of piecewise constant history of effective population size. We transform equations for the relative frequencies of SNP to the form with nondiverging coefficients and we provide expressions, obtained with the use of methods of hypergeometric summation, to compute these coefficients. We also provide analytical expressions to describe the influence of the discovery procedure (ascertainment) on SNP frequencies. Our methods allow us to perform tasks that otherwise are prohibitive or cumbersome, like analyzing large genealogies, estimating confidence limits for parameters by resampling studies, and studying sensitivity of models to parameter changes. Using our methods we study our power to test the null hypothesis of evolution with constant population size vs. the alternative hypothesis of population expansion, for SNP data, under the exponential model of population size change. We also analyze some of the available SNP data (PicoultNewberget al. 1999; Trikkaet al. 2002) and we compare our results to those obtained in previous studies concerning estimation of populations size changes (Slatkin and Hudson 1991; Rogers and Harpending 1992; Polanskiet al. 1998; Weiss and Haeseler 1998).
METHODS
We consider the process of coalescence with timechanging effective population size. Notation for the coalescence tree, for the sample of size n = 5 DNA sequences, is shown in Figure 1. Time t is measured, in number of generations, from the present to the past. Random times between coalescence events are denoted by S_{n}, S_{n}_{–1},..., S_{2} and s_{n}, s_{n}_{–1},..., s_{2}. Cumulative times to coalescence, from sample of size n to sample of size k – 1, are denoted by T_{k}, k = 2, 3,..., n, and their realizations by corresponding lowercase letters t_{n}, t_{n}_{–1},..., t_{2}, 0 < t_{n} < t_{n}_{–1}... < t_{2}.
We assume that an observed SNP was produced by a single, neutral mutation, like the one denoted in Figure 1 by an open circle. In Figure 1 sequences 4 and 5 have mutant alleles (bases), while sequences 1, 2, and 3 have ancestral ones. In the situation where it is not known which allele is mutant and which is ancestral, we use the terms rare and frequent allele. In other words, the SNP in Figure 1 has configuration b = 2 mutant vs. n – b = 3 ancestral, or b = 2 rare vs. n – b = 3 frequent alleles. We assume that mutation intensity for SNPs is very low; i.e., they follow the infinitesites mutation model.
Probability that a SNP has b mutant bases: Probability q_{nb} that a SNP site in a sample of n chromosomes has b mutant bases, under the infinitesites mutation model, is given by Griffiths and Tavare (1998, Equation 1.3) in terms of expectations of times in the coalescence tree (see also articles by Fu 1995; Sherryet al. 1997; Nielsen 2000; Wooding and Rogers 2002). In our notation, this expression has the form
The above expression can be written as
Equation 2 is an analytic expression for probabilities q_{nb}. Wooding and Rogers (2002) derive equations with the structure analogous to Equations 2, 3, 4, 5, which also contain expectations defined in (3). In contrast to (2), they do not provide explicit expressions for coefficients in the equations; instead they use linear algebra operations (matrix diagonalization) to compute them. Both articles (Wooding and Rogers 2002; Polanskiet al. 2003) report that it is rather difficult to efficiently apply analytical formulas for genealogies of size n > 50 because of the occurrence of diverging terms with alternating signs.
Methods for computation of q_{nb} for large genealogies: To avoid large numerical errors in summations in (2) for genealogies n > 50, one needs to apply computations with precision of hundreds, or even thousands, of decimal digits (Wooding and Rogers 2002), which significantly slows down computational process and requires appropriate software. Such computations must be also carefully executed. It is necessary to repeat computations several times, with an increasing accuracy, and to examine the convergence of the returned values.
Wooding and Rogers (2002) developed a way to avoid the necessity of extending precision of the arithmetics, based on a uniformization technique of computing matrix exponents. It is applicable for the case when the population size changes in a stepwise (piecewise constant) manner, with a finite number of steps, and it allows evaluating the expressions in a standard double precision arithmetic. However, when the number of steps in the population size history becomes large, e.g., if one attempts to approximate a given N_{e}(t) by a piecewise constant function, this approach may be difficult to apply.
Below, we present a method for computing q_{nb} for large genealogies, which is more general than the one developed by Wooding and Rogers (2002), in the sense that it does not require assumption of stepwise change of N_{e}(t). The idea is to reverse the order of summation in both denominator and numerator in Equation 2. We observe that the resulting expressions contain coefficients that do not explode when n increases. Proceeding in this way we obtain
Expressions in (10) and (9) are sums of hypergeometric series, which can be seen by factoring the denominators in (6),
Influence of the ascertainment procedure on SNP sampling frequencies: Most of the published data on SNP sampling frequencies are obtained in a twostep process, where the first step involves discovering chromosomal locations of a number of SNPs, and the second one involves DNA sequencing of a sample of n chromosomes restricted to locations discovered in the first step. The first step is called SNP ascertainment and is based on number of chromosomes smaller than n. As described in previous studies, taking into account the ascertainment scheme is a very important aspect of SNP data analysis. Below we derive expressions for modeling the way in which ascertainment modifies SNP sampling frequencies.
We use the following notation introduced by Wakeley et al. (2001): Data set size is n = n_{D} + n_{O}, and ascertainment set size equals to n_{O} + n_{A}, where n_{A} stands for the number of ascertainmentonly samples; n_{O}, the number of overlapping samples (both in the ascertainment study and in the later data set); and n_{D}, the number of dataonly samples.
To determine how ascertainment modifies probability distribution (22), we merge ascertainment and data sets to obtain the joint set of size n_{J} = n_{D} + n_{O} + n_{A}. We treat the ascertainment procedure as sampling SNP alleles, without replacement, from the joint set. A SNP is discovered if (a) both alleles are present in the ascertainment sample and (b) none of the alleles in the ascertainment sample has number of copies less than G, where G is a predetermined threshold. Since the joint set contains elements of two types (two alleles), the number of copies of alleles in the ascertainment sample follows a hypergeometric distribution. We analyze two cases: (i) no overlap, which means n_{O} = 0, n = n_{D}, n_{J} = n_{D} + n_{A}; and (ii) overlap only, which means n_{A} = 0, n = n_{J} = n_{D} + n_{O}. The case where both overlap and ascertainmentonly samples are present is obtained by combining i and ii. We compute frequencies of discovered SNPs in the data set, which follow from conditions a and b above. We analyze first the case i. If a SNP in the joint set has b mutant and n_{J} – b ancestral bases, then the probability that a sample of size n_{A} from the joint set has β mutant and n_{A} –β ancestral bases is
If it is not known which of the alleles is mutant and which one is ancestral, we need to symmetrize
In the sequel, we refer to the models described above as type i and type ii ascertainment, respectively.
Likelihood function of the sample: Data studied are derived from a number of SNP sites. Let us denote the number of SNP loci by M and random variables defined by diallelic data by
When SNP sites are located far from one another, random variables
{X_{1}, X_{2},..., X_{M}} in (21) are independent. If the observed numbers of copies of rare alleles are
RESULTS
Exponential history of population size: In our computations we assume an exponential history of effective population size. In previous publications devoted to SNP and demography, Nielsen (2000) assumed an exponential history of N_{e}(t). However, his analysis is very brief and restricted only to simulations. Others (Wakeleyet al. 2001; Wooding and Rogers 2002) analyzed stepwise histories of effective population size changes.
For an exponential scenario of population growth
It turns out that sampling frequencies of SNPs depend only on the product parameter κ= rN_{e0} of initial effective population size and exponential factor.
Distributions of SNP frequencies: Figure 3 provides examples of probabilities of different configurations of SNP sites, for sample size n = 30, and different values of the parameter κ (0, 1, 10), under the assumption that data collection did not include an ascertainment step [expression (22)] or under the ascertainment model of type ii [expression (20)] with n_{O} = 10, G = 1, or G = 2. As already reported in many articles, increasing κ results in higher proportions of rare alleles in the sample. Plots in Figure 3 also show how ascertainment modifies the distribution of SNP frequencies. Increasing the threshold value G flattens the distribution of frequencies. Both types of ascertainment (i and ii) have similar effects on SNP frequency distributions (results not shown).
Likelihoodratio tests to detect signatures of population growth: An interesting issue is our power to test hypothesis H_{0} of evolution with constant population size, κ= κ_{0} = 0, against the alternative hypothesis H_{1} of population expansion, κ=κ_{1} > 0, on the basis of SNP data. It is also of interest to determine how this power is affected by the ascertainment step of data collection. From previous computations it follows that SNP data can be seen as samples from multinomial distributions given by expressions (22), (19), or (20). Assuming that the number of SNP sites is always large enough to allow asymptotic approximation (Bickel and Doksum 2001, p. 227) we computed powers of singlevalue vs. singlevalue likelihoodratio tests of statistical null hypothesis H_{0} (constant population size κ=κ_{0} = 0) vs. the alternative H_{1} (population expansion with κ=κ_{1} > 0). We assumed significance level α= 0.05 and values of κ_{1}, κ_{1} = 0.1, κ_{1} = 1, κ_{1} = 10, κ_{1} = 100. Table 1, A and B, gives powers of likelihoodratio tests for sample size n = 50, for different models of ascertainment: no ascertainment [probabilities given by expression (22)] or ascertainment model type ii [expression (19)] with parameters n_{O} and G. Table 1A is for the number of SNP loci M = 30, and Table 1B is for M = 100. From values of powers of tests depicted in Table 1, A and B, one can see that the cases κ_{0} = 0, κ_{1} = 0.1 are practically indistinguishable; κ_{0} = 0, κ_{1} = 1 may be distinguished only for a large enough number of SNP sites, while κ_{0} = 0, κ_{1} = 10, or κ_{1} = 100 are rather easily distinguishable even for small numbers of SNPs. The ascertainment step in data collection can deteriorate the power to detect signatures of population growth. Increasing the threshold value of G, the aim of which typically is filtering out sequencing errors in the data, also progressively lowers the probability of rejecting the hypothesis H_{0} of constant population size; i.e., it increases the probability of committing type II error. This results from the flattening effect of increasing G observed in Figure 3.
Data analysis: Data on segregating sites in mitochondrial DNA from Cann et al. (1987): First, we apply our method to the data on segregating sites in mitochondrial DNA from the article by Cann et al. (1987). We fit the exponential scenario (25) to these data by treating each segregating site as an independent SNP. Technically, we estimate the product parameter κ= rN_{e0} in (25).
Data in Cann et al. (1987) include 195 segregating sites in 148 individuals. Table 2 shows the statistics of segregating sites in these data. Elements in the first column (b) are possible numbers of copies of the rare allele, and elements in the second column (c_{b}) are numbers of segregating sites in the sample that have the number of copies of the rare allele equal to b. Figure 4 shows the plot of loglikelihood function obtained by using expressions (8, 9, 10, 11, 12, 13, 14, 15) and (26). Maximum of the loglikelihood function is attained at
Segregating sites collected by Cann et al. (1987) are obtained from nonrecombining DNA and the independence assumption is clearly not satisfied. To explore whether a violation of the assumption that SNPs are independent significantly affects the estimate of the parameter κ, we have performed 100 coalescent simulations of genealogies representing ancestries for 148 mtDNA sequences. We added mutations along branches of coalescence trees according to the infinitesites model with intensity μ. In the simulations we assumed mutational time scale τ= 2μt and exponential change of θ(τ) = 2N_{e}(τ)μ, θ(τ) =θ_{0} exp(–ρτ), with parameters θ_{0} = 400, ρ= 0.2. So, the true value of the product parameter κ was κ= 80. For each of these 100 simulation experiments we treated segregating sites as independent SNPs and we estimated the parameter κ by maximizing likelihood (24). We obtained the mean of estimates equal to 86.8 and standard deviation equal to 29.7. This confirms that our approach, at least for these specific values, will allow us to obtain a reasonable estimate of κ. From these simulations follows the estimate of 95% confidence interval for the parameter κ, when the sample consists of 148 mtDNA sequences and the demography is as shown above. This estimate, [mean – 2 standard deviations, mean + 2 standard deviations], equals κ ∈ [27, 147]. This estimate is quite consistent with the 95% confidence interval obtained from likelihoodratio statistics, κ ∈ [40, 166]. The shift toward the left of the confidence interval based on simulations results from the asymmetric shape of the distribution of the estimate of κ. By applying a logarithmic transformation to simulation results (estimates of κ) we were able to obtain almost perfect agreement of the two confidence intervals, [40, 166] and [45, 175].
SNP data from PicoultNewberg et al. (1999) and Trikka et al. (2002): There are several population studies in the literature where relative frequencies of SNP alleles are shown. We have chosen data from the research by PicoultNewberg et al. (1999) and data on SNPs in three human genes: BLM, WRN, and RECQL, reported recently by Trikka et al. (2002). In our analysis, we used the data on Caucasians from both sources. The first reason to focus on Caucasians was the possibility of comparing two results, and the second reason was that discovery samples were from Caucasians. PicoultNewberg et al. (1999, Table 4) have 44 SNP sites in 8 Caucasians (16 chromosomes), while Trikka et al. (2002, Table 2) show allele frequencies of a total number of 31 SNPs in samples of chromosomes of sizes varying from 154 to 158.
When analyzing SNP data we followed remarks given in the source articles (PicoultNewberget al. 1999; Trikkaet al. 2002) to adjust parameters n_{A}, n_{O}, and G of the model of ascertainment procedure. We modeled the ascertainment procedure for collecting data from PicoultNewberg et al. (1999) by using expression (17) with n_{A} = 10 and G = 2. A plot of loglikelihood function for the data on Caucasians from the articles by PicoultNewberg et al. (1999) is shown in Figure 5. It attains a maximum at
Sensitivity of estimates to ascertainment model parameters: A question arises: How sensitive are the estimates of parameter κ to changes of the model of the ascertainment? We studied this question by increasing or decreasing the value of the threshold G in expressions (17) and (18). Indexing the estimated parameter with n_{A}, n_{O}, and G, we can denote our estimates from the previous section as
The results of computations show an extreme sensitivity of estimates to the ascertainment model. Notably,
The fact that the ascertainment model strongly affects estimates of parameters is also confirmed in the previous articles on SNPs. Wakeley et al. (2001) in their Figure 3 show a large bias in SNP frequencies resulting from ascertainment. Similarly, Nielsen (2000)uses a rather oversimplified model, n_{O} = 2, G = 1, for data (PicoultNewberget al. 1999) and obtains
DISCUSSION
The methods developed in this article allow us to analyze large data sets and carry out computations for different parameter values, which helps us draw more conclusions from data. We have shown examples of applying our methodology to the study of several issues arising in SNP data analysis.
We are particularly interested in the problem of what are reasonable values of the exponential growth product parameter κ= rN_{e0} obtained on the basis of DNA data. Insight into this problem can be gained by comparing estimates obtained using different approaches.
Our aim when estimating κ from relative frequencies of segregating sites in the article by Cann et al. (1987) was to confirm that different methodologies used for the same data will still lead to comparable results. Therefore, we compared our estimate, obtained by treating nonrecombining segregating sites as SNPs, to those previously obtained on the basis of the same or similar data, but with the use of different methods. Studies that we compared to ours were those by Slatkin and Hudson (1991), Rogers and Harpending (1992), and Polanski et al. (1998), who used pairwise difference statistics, and by Weiss and von Haeseler (1998), who applied the maximumlikelihood approach. Data in these articles originate from different sources, but considering estimations of the authors, reasonable ranges of the product parameter κ, for both the worldwide population and Caucasians, fit into the interval from κ= 50 to κ= 500. Our estimate of
Mutation intensity (per site) at autosomal loci is approximately one order of magnitude lower than that in mtDNA (Li 1997). However, the estimate of the product parameter κ= rN_{e0} is invariant with respect to timescale changes and therefore does not depend on the value of the mutation intensity. We can assume that mutation intensity is used only to scale the time axis. The effective population size for autosomal loci is four times the effective population size for loci at mtDNA. So, the estimate of κ from mtDNA should be onefourth the estimate of κ from nuclear DNA. Taking into account the large stochastic variation, the estimates of κ coming from SNP data should then be comparable (of the same order of magnitude) to those obtained from mtDNA.
However, our estimates of the parameter κ based on SNP data,
Acknowledgments
The authors are grateful to Peter Paule and Markus Schorn for making their program, implementing Zeilberger's algorithm in Mathematica, available to the scientific community. The authors were supported by National Institutes of Health grants GM58545 and CA75432, Polish Scientific Committee (KBN) research projects PBZ/KBN/040/P04/2001 and 4T11F 01824, and NATO collaborative linkage grant LST.CLG.977845.
Footnotes

Communicating editor: N. Takahata
 Received January 29, 2003.
 Accepted May 30, 2003.
 Copyright © 2003 by the Genetics Society of America