This is a follow-up to this morning’s post on DNA, cold hits, and statistics.
Prof. David Kaye, whom I cited in this morning’s post, has responded to my e-mail and given me permission to quote him.
Thanks for your inquiry. This is a surprisingly subtle statistical question. I have devoted two chapters to it in a forthcoming book to be published by Harvard University Press and have circulated a manuscript on the California cases and the general issue to law reviews. I served on the 1996 NRC committee that recommended adjustment, but I now find it difficult to defend that recommendation. Basically, there are two distinct questions:
Question 1. What is the chance that a database composed entirely of innocent people (with respect to crime being investigated) will show a match? For databases that are small relative to the number of people who could have committed the crime, the NRC adjustment makes sense. The British experience mentioned in article shows that this chance is much larger than the random match probability. But why is this “innocent database” probability important when considering what the evidence of a match to a named individual proves?
Question 2. How much does fact that the defendant identified by a trawl through the database matches – and no one else in the database does — change the odds that he is the source of the DNA at the crime-scene? This is the question that is of interest to a jury trying to weigh the evidence. It is the one that Peter Donnelly and other statisticians have addressed. The answer is that the single match in the database raises the odds even more (but only slightly more) than does testing a single person at random and finding that he matches. As you point out, in the limit of a database that includes every person on earth, the evidence of a single match in the database becomes conclusive. How can the value of the evidence possibly decline as small databases get slightly bigger, then somehow switch direction and get immensely stronger as they get bigger still?
The discussion of the issue in the news and the courts is oversimplified and misleading (but entertaining). The manuscript of the law review article is attached. Feel free to quote from it as “submitted for publication.”
With best wishes,
I quoted from Prof. Kaye’s article in comments to the previous post. Let me quote one of those passages here, because I think it sheds light on the issue:
We can approach this question in two steps. First, we consider what the import of the DNA evidence would be if it consisted only of the one match between the defendant’s DNA and the crime-scene sample (because he was the only person tested). Then, we compare the impact of the match when the data from the trawl are added to give the full picture. . . . In the database trawl case . . . [i]f anything, the omitted evidence makes it more probable that the defendant is the source. On reflection, this result is entirely natural. When there is a trawl, the DNA evidence is more complete. It includes not only the fact that the defendant matches, but also the fact that other people were tested and did not match. The more people who are excluded, the more probable it is that any one of the remaining individuals — including the defendant — is the source. Compared to testing only the defendant, trawling therefore increases the probability that the defendant is the source. A database search is more probative than a single-suspect search.
I should note that Prof. Kaye’s exposition of the two relevant questions is similar to, but somewhat different from the questions that I posed in my original post. In an attempt to illustrate what I believed to be the questions addressed by the two competing camps, I posited two similar questions: 1. What are the chances that a search of this database will turn up a match with the DNA profile? and 2. What are the chances that any one person whose DNA matches a DNA profile is indeed the person who left the DNA from which the profile is taken?
Prof. Kaye’s questions state the issue in a more refined and, I believe, more accurate manner. As to my questions, he says in a follow-up e-mail:
The answer to your question #1 depends on the chance that the database contains the source (and, if “a match” means exactly one match, no one else with the matching type). That is not the question that the statisticians who favor an adjustment to the random-match probability are considering. The proposed statistical adjustment relates to the following modified version of your #1:
1′. What is the chance that a search of a database will turn up exactly one match when the source of the crime-scene DNA is someone who is unrelated to everyone in the database?
Likewise, the statisticians who argue that the database search is better evidence than the single-suspect search (and they are the majority of those writing on the topic) focus on a variation of your second question:
2.’ What is the chance that the named individual whose DNA matches is the source?
I confess that I did not read the coin example you provided too closely. I suspect that it is correct. I have an example along these lines in my article (inspired by an example in the Donnelly-Friedman article).
Thus, I think the thrust of your remarks are on target, but some of the details of your analysis could be refined.
I thank Prof. Kaye for his correspondence. And yes, the coin example was rather long.
Incidentally, I have an e-mail in to Prof. Peter Donnelly, the Oxford statistician whom I cited in my earlier post. He is out until May 13.
I also received a nice e-mail from Jason Felch, one of the authors of the L.A. Times article, in a response to an e-mail I sent him. I have asked him for permission to quote from the e-mail and am awaiting his reply.