Patterico's Pontifications

5/10/2008

Yet Another Error in That Article About DNA, Cold Hits, and Statistics

Filed under: Crime,Dog Trainer,General — Patterico @ 1:52 pm

I have noticed a third error in the L.A. Times article on DNA, statistics, and cold hits. The article said:

Typically, prosecutors rely on FBI statistics to estimate the rarity of a particular DNA profile in the general population. This calculation is known as the Random Match Probability.

The chance that two unrelated people will share the same 13 markers can be as remote as 1 in a quadrillion — a number with 15 zeros. Because the match in Puckett’s case involved only 5 1/2 genetic locations, the chance it was coincidental was higher but still remote: 1 in 1.1 million.

Even placing aside the issue of database searches, it is incorrect to say that “the chance it [the match in Puckett's case] was coincidental was . . . 1 in 1.1 million.” Even if no database had ever been used, this would be an incorrect statement. It is an example of a statistical fallacy known as the “Prosecutor’s Fallacy.” (For detailed discussions of the Prosecutor’s Fallacy and ways that numbers can be misrepresented by English formulations, read here, here, and here.)

It’s important to emphasize that my complaint has nothing to do with the fact that a database search was used. It is true that the article goes on to explain that Puckett’s attorney thought this number misrepresented the probabilities because the match was made through a database. But that is irrelevant to the particular complaint I am making here about the Prosecutor’s Fallacy. Even if the match had not been made through a database, it would be incorrect to say that 1 in 1.1 million represents the chance that the match to Puckett was a coincidence. (I predict that this paragraph of my post will be the one most widely ignored by the comments this post is likely to generate.)

This is a hard concept to explain, and it has much less to do with the actual math than the way that the math is expressed in English.

A simple example makes the point. Let’s say the odds of winning the lottery are 1 in 100 million. That makes it close to certain that you won’t win if you buy only one ticket.

Now assume you bought one ticket and you won. Your jealous friend says that the chance of your numbers matching being a coincidence was 1 in 100 million. By phrasing the probabilities this way, your friend is saying that it was close to certain that you would win. When you say “the chance that this event resulted from a coincidence are 1 in a 100 million” you’re saying the event was almost certain to happen.

Your friend’s statement is very similar to the way the L.A. Times article phrased the probabilities — and it is an example of the Prosecutor’s Fallacy. By using the formulation “the chance of this match being a coincidence are 1 in 100 million” the speaker is taking extremely low odds and making them sound extremely high. What your friend should have said — and what he meant to say — was this: “the chance you’d win was 1 in 100 million.”

Now:you want a real coincidence? This very distinction recently cropped up in the news — and the L.A. Times covered it, and temporarily showed some hint of understanding it. In an article about a Ninth Circuit decision regarding prosecutorial error in characterizing the meaning of DNA results, the L.A. Times discussed the distinction in this way:

The error stemmed from the prosecution expert wrongly conflating two very different mathematical probabilities: The probability that the crime scene evidence matched a person selected at random from the population and the probability that the defendant was guilty.

This formulation is confusing because of the use of the word “guilty” as an imprecise shorthand for the phrase “the person who donated the crime scene DNA.” Still, this shows that the reporters — one of whom, Jason Felch, co-wrote last Sunday’s article on the Puckett case — understand the distinction, or at least show the capacity to understand it.

The original article from last Sunday, about the Puckett case, mis-expresses the concept at another point, but blames it on the prosecution:

Puckett insisted he was innocent, saying that although DNA at the crime scene happened to match his, it belonged to someone else.

At Puckett’s trial earlier this year, the prosecutor told the jury that the chance of such a coincidence was 1 in 1.1 million.

Did he really? Or did the reporters mischaracterize what the prosecutor told the jury? I don’t know. If the prosecutor actually expressed the odds that way, he raised an appeal point for the defense based on the Prosecutor’s Fallacy. But I’m not convinced the error wasn’t the reporters’, given that they gave the same misleading description elsewhere in the article.

(I should note that Eugene Volokh caught this iteration of the error in his previous post on this topic. But because the error is attributed to the prosecutor in this quote, I didn’t notice that the reporters themselves had made the exact same error elsewhere in the article.)

I do know this: the paper has continued to describe random match probability in this misleading way. In an article published yesterday — after the publication of the article describing the case about the Prosecutor’s Fallacy — the deck headline reads:

A long-time scientific controversy centers on how to calculate the probability that such a match would be the result of coincidence.

This is wrong. Granted, it’s hard to express the concept in a headline. But this formulation presumes that you have a match, and you’re talking about the probability that it is the result of a coincidence. Using our lottery example, it would be like saying: “the controversy centers on how to calculate the probability that a particular person having won would be the result of coincidence.” That is not what random match probability addresses. It addresses instead the frequency with which a particular profile appears in a population of unrelated individuals. The database adjustment in question addresses the probability that, if a database is composed of individuals unrelated to the person who donated the crime scene DNA, a database search will nevertheless result in a match to the crime scene DNA.

Again, saying “you had a 1 in 100 million chance of winning” is not the same as saying “the chance your victory resulted from a coincidence is 1 in 100 million.”

This is not trivial. It is important, because people need to understand that the random match probability is not the chance that the defendant is innocent, or (put another way) that it is a coincidence that he is sitting in the defendant’s chair.

It is important to be accurate about these concepts, and the L.A. Times — probably in an admirable effort to simplify them — keeps mucking them up. Which is, coincidentally (?!), the same thing they accuse the courts of doing.

Now, I feel for the reporters. In writing my posts critical of the article, I have myself at times used formulations that are either unclear or do not precisely represent the statistics involved. I have had to write more than one update clarifying my position or removing language I feared might be inaccurate. (God help me, some commenter might even make me do so with this post!) Expressing these concepts in clear, precise, and accurate English is, as I said yesterday, like walking a tightrope.

To me, this illustrates the fact that people can be easily misled by these concepts — which illustrates the need to be extra careful when you’re a major newspaper with a Sunday circulation of over 1 million, writing about the concept on the front page of your Sunday edition.

And, it illustrates the point that when you make these mistakes, it’s important to admit them. I’ll be writing the paper about this error as well (I already wrote them about the first two errors two days ago, in an e-mail I reproduced here). I hope they get around to correcting the errors prominently, given the prominence given to the original article.

17 Responses to “Yet Another Error in That Article About DNA, Cold Hits, and Statistics”

  1. In the ninth circuit opinion the prosecution’s expert witness was actually even worse. According to the opinion, she first described the random match probability correctly, but when pressed by the prosecution to rephrase it, she then stated it incorrectly.

    But the second error the expert made was worse, and this is most likely what led to it being reversible. From the opinion:

    “Second, Romero inaccurately minimized the likelihood that Troy’s DNA would match one of his four brothers’ DNA, thus underestimating the likelihood that one of Troy’s brothers could have been the perpetrator. She testified that there was a 25 percent chance of two brothers sharing both alleles at one locus, and, using that figure, a 1/6500 chance that one of Troy’s brothers would match Troy’s DNA at all five loci. The Mueller Report indicated that Romero’s calculation was incorrect, as the correct figure is 1/1024. More importantly, Romero’s testimony is misleading because it presented the narrowest interpretation of the DNA evidence. Had Romero accounted for Troy’s four brothers, two of whom lived in Carlin and two of whom lived in neighboring Utah, the chance that Troy’s DNA would match at least one of his four brothers’ DNA can increase to 1/66—almost one hundred times the probability asserted by Romero. This omission was especially egregious given that the victim, Jane, had twice identified Troy’s brother, Trent, as the assailant.”

    My thoughts after reading the opinion were that the prosecution should have known better, that this particular expert witness shouldn’t ever be used, ever again, and the failure to obtain a DNA sample from the other brother who had been identified is mystifying to me.

    Skip (163356)

  2. 1.

    “… and the failure to obtain a DNA sample from the other brother who had been identified is mystifying to me.”

    Can you force someone to give a DNA sample if they aren’t a suspect? I think you need probable cause for a warrant. Perhaps in this case the girl’s id would suffice but I imagine there are cases where you have no probable cause to obtain a sample from a relative just to make your case stronger.

    James B. Shearer (fc887e)

  3. I’m inclined to think that “likelihood” is an inappropriate for method for comparing living brothers’ DNA; “measurement” would be the correct method. We know which five and a half genetic markers are to be used; either they match, or they don’t.

    htom (412a17)

  4. The chance of the brothers matching at all 5 loci could be even higher. The 1 in 4 chance of a match at any given locus depends on both parents being heterozygous (having two different genes) at each locus. If either parent is homozygous (2 copies of the same gene) or each parent has a copy of the same allele then the risk of a chance match goes up.

    Lloyd Flack (ddd1ac)

  5. The L.A. SLIMES is like the horns of a steer a point here a point there and a lot of bull in between

    krazy kagu (40bcdd)

  6. Two, two, two mints in one here. ;^)

    Minor nits actually, but in the interest of coherent and consistent reasoning in these long and complicated matters, I’ll make them.

    James B. Shearer writes 5/10/2008 @ 4:35 pm:

    Can you force someone to give a DNA sample if they aren’t a suspect? I think you need probable cause for a warrant.

    Yes a warrant will require someone to furnish a DNA sample. But DNA samples can be, and have been, obtained perfectly legally by stealth or even by consent. One such example I recall from some years ago was that an undercover officer simply picked up a cup and drinking straw that he observed a suspect first drinking and then discarding. Or maybe he even asked for the cup without revealing his identity. In any case he obtained the sample with adequate chain of custody to use as evidence, and without a warrant.

    htom writes 5/10/2008 @ 4:47 pm:

    I’m inclined to think that “likelihood” is an inappropriate for method for comparing living brothers’ DNA; “measurement” would be the correct method.

    The word “likelihood” is a term of art in statistics, as is the word “probability”. I’ve seen “likelihood” bandied about here as a synonym for “probability”. But it really isn’t. I may have made the same error at some point. It’s an easy error to make.

    Likelihood refers to the probability density function of variable parameters with the outcome fixed.

    Probability refers to the probability density function of a variable outcome with the density function’s parameters fixed.

    If that’s not clear as mud, here is the Wikipedia article.

    Confusion about the meaning of “likelihood” (erroneously using the term “likelihood” as the probability that the distribution parameter values are the right ones, given the observed sample) is one form of the prosecutor’s fallacy.

    Occasional Reader (0a66b9)

  7. It would seem to me that the real odds of guilt have to be determined by listing all the possible scenarios consistent with the data, and then comparing the original odds applicable to each. If the data can only be explained by two possible theories, each of which originally stood a 1 in a million chance of occurring, then the two would seem equally probable now.

    Xrlq (62cad4)

  8. Occasional Reader — My point was with five people, you could measure the values five and compare. No probability calculations are needed.

    htom (412a17)

  9. This blog entry emphatically shows what is wrong with both the US legal system in general and jury trials specifically.

    There should be no way that such a subtle difference in how closing arguments are stated(or prosecutor’s trial statements) should/could be a basis for appeal.

    An existing (and proven “appeal proof”) method for stating DNA probabilities must already exist…is there one in the Prosecutor 101 Handbook?

    Even if there isn’t a proven method for stating the odds of coincidental match of an innocent person, juries would normally judge guilt/innocence based on all evidence presented, including consideration of the extremely small chance of a purely coincidental match.

    After all, if the defense could present doubt as to whether the defendant was in the city at the time of the attack and had no history of this behavior, then the DNA evidence would carry less importance. I think any reasonable jury would consider the DNA evidence as a tool for the police to identify suspects but not the final evidence of guilt.

    Which means in the real world, to real juries, the subtle differences in statements should carry no weight in the appeals process…

    db (3c0940)

  10. 10

    In this particular case the jury was not told the hit came from a DNA database. That is not a subtle difference.

    James B. Shearer (fc887e)

  11. The discussion centers on nuances surrounding how the odds/probabilities are presented and then interpreted…in my mind, any jury composed of non-statisticians would see the nuances as being very subtle to the point of obscurity.

    I still think it is a flaw in our system if this were allowed as a basis for appeal…the DNA stats only demonstrate the chance of two people sharing the same 5 1/2 markers to be very remote indeed but it is not zero.

    In the absence of other evidence, I don’t think this case would have gone to trial…but I am a bit confused now, how did the prosecutor tell the jury the suspect was identified, considering it was a 30 year old case?

    db (09594e)

  12. The word “likelihood” is a term of art in statistics, as is the word “probability”. I’ve seen “likelihood” bandied about here as a synonym for “probability”. But it really isn’t.

    [...]

    Likelihood refers to the probability density function of variable parameters with the outcome fixed.

    Probability refers to the probability density function of a variable outcome with the density function’s parameters fixed.

    Quite correct. People who want to talk about probability and statistics should keep this point in mind lest you look foolish.

    Steve Verdon (94c667)

  13. In the absence of other evidence, I don’t think this case would have gone to trial…but I am a bit confused now, how did the prosecutor tell the jury the suspect was identified, considering it was a 30 year old case?

    IIRC, in the Puckett case the jury was not told how Mr. Puckett was found. The probability they needed to know, the probability of a single match given the probability of 1/1.1 million and that there were 338,000 people in the database thus a probability of a hit, irrespective of guilt or innocence, is 0.226. The probability of getting at least one match (possibly more) is 0.2646. These numbers were not given to the jury and that was a serious mistake.

    Steve Verdon (94c667)

  14. 12

    From the LAT article:

    “Not long into their deliberations, jurors sent the judge a handwritten question: “How was [Puckett] identified as a person of interest?”

    Consistent with his earlier ruling, Benson did not tell them about the database. He replied that they should not speculate about how Puckett was identified.”

    This was extremely unfair to the defense.

    James B. Shearer (fc887e)

  15. This was extremely unfair to the defense.

    No kidding.

    So was not presenting them with the relevant information on probabilities of matches irrespective of guilt. That kind of thing is, IMO, tantamount to with holding evidence from the defense and should be grounds for an appeal.

    Steve Verdon (4c0bd6)

  16. Not only that, but the prosecutor had the chutzpah to close by asking rhetorically, “What are the odds” that the DNA would randomly match to someone who happened to be a serial rapist. Pretty damned high, actually, if you know they found the guy by fishing for suspects in a pool full of convicted sex offenders, which Merin obviously knew but the jury conveniently did not.

    Xrlq (b71926)


Powered by WordPress.

Page loaded in: 0.2854 secs.