It might seem a little odd for me to vet an e-mail I am planning to send by publishing a draft of it on a public website that receives thousands of hits every day. But hey, odd is fun! And so I invite you to read this draft (yet unsent) of a letter to the authors of the recent L.A. Times article on DNA, cold hits, and statistics.
I’d like readers to review it before I send it because I am not a statistics expert, and although I consulted with more than one during the process of drafting this, I want to make sure I have made no mathematical or logical misstatements.
Mr. Felch and Ms. Dolan,
After discussions with numerous people with statistical expertise, I am reasonably confident (that is, as confident as a layman like myself can be) that your recent front-page article on DNA cold case statistics gravely misstated the meaning of the math you discuss.
Your article said:
Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person.
In Puckett’s case, it was 1 in 3.
I don’t believe the math in question supports the statement that there was a “1 in 3” chance that “the database search had hit upon an innocent person” in selecting Puckett.
The starting point for my analysis was this post by Eugene Volokh, a UCLA law professor and blogger. Prof. Volokh agrees with me that your formulation is wrong. He justifies his argument with effective argumentation and examples; I commend his post to you. My e-mail to you (which I am blogging on my site) merely expands on Prof. Volokh’s argument as it relates to the article.
(To keep the discussion simple, I will assume there are no issues relating to data corruption or human error. I’ll also stick with the numbers used in your article: a random match probability of 1 in 1.1 million, and a database of 338,000.)
The logic behind the database adjustment was expressed in a report from the National Research Council as follows:
Recommendation 5.1 proposes multiplying the random-match probability (P) by the number of people in the database (N). If the person who left the evidence DNA was not in the database of felons, then the probability that at least one of the profiles in the database would also match the incriminating profile cannot exceed NP.
The clear working assumption here is that the database consists of “innocent” people who did not leave the DNA in the database.
This makes sense, at least in a hypothetical case where the jury is informed that the authorities came to suspect the defendant because of a database hit. There is a certain “what are the chances?!” quality of DNA evidence that presumes the defendant was under suspicion before the DNA comparison was done. In other words, if the defendant is before the jury because of a database hit, and the jury knows it, the jury may be “wowed” by the fact of the hit. But the impact of this “wow” factor is considerably lessened likely if the jury is told that, in a hypothetical search of a database of completely innocent people, there is a 1/3 chance of a hit.
Thus, it seems clear to me that the idea of the adjustment is to communicate to the jury the likelihood of a false positive, based on the assumption that the true donor of the incriminating profile is not in the database.
My understanding is bolstered by an e-mail I received from Prof. David Kaye, who served on the 1996 NRC committee that recommended the adjustment. In that e-mail, Prof. Kaye stated:
[T]he statisticians who favor an adjustment to the random-match probability are considering [the question:] What is the chance that a search of a database will turn up exactly one match when the source of the crime-scene DNA is someone who is unrelated to everyone in the database?
He restated the question in this way:
What is the chance that a database composed entirely of innocent people (with respect to [the] crime being investigated) will show a match?
Note that the fundamental assumption of the hypothetical is that everyone in the database is innocent. Then, and only then, can one use the adjusted figure recommended by the committees as a (very rough) approximation of the chances of a false positive.
If by contrast, you start with the assumption that you don’t know whether the suspect is in the database or not, then the 1/3 number tells you nothing about whether a single hit from the database is a hit to a) the true donor of the incriminating DNA or b) an innocent person who happens to share the same profile (i.e. a “false positive”).
It’s important to keep in mind that what we’re talking about here is the situation where a database search is conducted, and has resulted in only one hit. The question is: what can we say, statistically, about that one hit?
In the case where you don’t know whether the database contains the the true donor, or “guilty” person (speaking very loosely), the meaning of a single hit from that database is a function of the likelihood that the true donor is in the database — and (given that only one hit was received) the likelihood that nobody else with that profile is in the database.
If you don’t know whether the true donor (or “guilty” person) is in the database or not, the 1/3 number is merely an expression of the likelihood of a hit — any hit. It’s not an expression of the chances that any resultant hit is a hit to an “innocent” person.
Again, I am not a statistics expert, and (perhaps as a result) I don’t know whether it is possible to tell juries anything statistically meaningful about the likelihood that the person in front of them is innocent. (Neither does Prof. Volokh, for what it’s worth.) But I feel fairly confident that the 1/3 number is not an expression of the probability that the person sitting in front of jurors is “innocent.”
Thus, I believe that your article is wrong to say, in the statement quoted above, that “the probability that the database search had hit upon an innocent person” in Puckett’s case was “1 in 3.”
That is simply not so, I believe.
If I’m right, I think The Times needs to correct this misimpression. What’s more, I think any correction should be very prominent, given the extreme prominence of the error (or what I believe to be an error) on the front page of the paper’s Sunday edition.
I hope you will see your way clear to discussing these issues with knowledgeable experts. I also hope that you will issue an appropriate and prominent correction if, after reflection and consultation with experts, you believe I have correctly analyzed the issue.
I look forward to your response.
P.S. I should note that my argument does not address the fact that guilt is not automatic once it is determined that the suspect is the donor of the DNA at the crime scene, just as innocence is not automatic once it is determined that he is not the donor. I assume you are aware of the difference between source attribution and guilt, and left out an explanation of the difference for space reasons.
Nor does my argument address the fact that the 1/3 number is an approximation of an approximation. (Prof. Volokh’s post has more details on the relevant statistics.) I also presume you were aware of this, and believe that the 1/3 number is simply a conservative simplification of the more complex equation that Prof. Volokh sets forth in his post.
My argument has nothing to do with these relatively minor quibbles. One could argue that ignoring them is necessary to keep the issue straightforward and simple. My problem is that, these minor issues aside, the way you have expressed the meaning of the adjusted number is (I believe) so misleading as to be fairly termed an error.
Please let me know what you think. I remain humble on the issue because of my lack of expertise in the field.