Patterico's Pontifications

5/29/2008

L.A. Times Corrects the Most Trivial of Three Errors From Its Article on DNA, Statistics, and Cold Hits

Filed under: Dog Trainer,General — Patterico @ 7:06 am

Recently, I pointed out three errors in an L.A. Times article on DNA, statistics, and cold hits (see here and here). Two were substantive and one was a trivial instance of the newspaper turning a fraction upside down.

Guess which one they are correcting?

DNA evidence: A May 4 article in Section A about the statistical calculations involved in describing DNA evidence in a murder case contained an arithmetic error. It said that multiplying the probability of 1 in 1.1 million by 338,000 was the same as dividing 1.1 million by 338,000. Actually, it’s the same as dividing 338,000 by 1.1 million. The answer, a 1 in 3 probability of a coincidental match between crime scene DNA and genetic profiles in a state database, was correct.

Yes, that is the trivial error.

Congratulations to Xrlq’s Aunt Ruth for noting it and bringing it to my attention.

But I am very, very disappointed that the paper is leaving two far more substantive and significant errors uncorrected. To recap, here was the first error:

Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person.

In Puckett’s case, it was 1 in 3.

The reporter tells me:

In our story, we did not write that there was a 1 in 3 chance that Puckett was innocent, which would be a clear example of the prosecutor’s fallacy. Rather, we wrote: “Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person. In Puckett’s case, it was 1 in 3.” The difference is subtle, but real.

(My emphasis.)

I fail to see any difference whatsoever.

The key fact: the hit to Puckett was the only hit that occurred. So when the article says there was a 1 in 3 chance that the search “had hit” on an innocent person, it is describing the chance that the hit to Puckett was a hit to an innocent person.

This is indeed the same as saying that there was a 1 in 3 chance that Puckett was innocent — which the reporter admits is inaccurate.

I believe the article meant to say this: if the database had consisted only of innocent people, there was a 1 in 3 chance that the search would hit on an innocent person. Phrased that way, the statement would be accurate, and would shed light on the question of how surprised we should be by a database hit.

But that’s not what the paper said. Instead, the article indicated the odds that the search “had hit” on an innocent person — in other words, the odds that Puckett himself was innocent.

By the way, the reporter indicated in an e-mail to me that he believes commenter Xrlq agrees with him on this point. He should read this post, in which Xrlq says that the 1 in 3 number as expressed by the paper is “almost certainly wrong.”

The second error was this passage:

Because the match in Puckett’s case involved only 5 1/2 genetic locations, the chance it was coincidental was higher but still remote: 1 in 1.1 million.

This is a classic example of the “prosecutor’s fallacy.” [UPDATE: I have a professor telling me it’s more accurately called the “transposition fallacy”.] The paper took a number meant to express the generalized odds of an event occurring, and used it to express the chance that a particular occurrence was a coincidence.

If there’s a 1 in 100 million chance of winning the lottery, you can’t say of the winner: “the chance that his win was coincidental was 1 in 100 million.”

That makes it sound like he was certain to win. But until he did win, he was almost certain not to.

Again, I have a good idea what the paper meant to say. But it’s not what they actually said.

The issue here is not the math. It’s about the proper way to express the math in English. It’s a tricky thing to do, but the fact that it’s tricky doesn’t excuse a failure to correct misleading language.

Comments (14)

14 Responses to “L.A. Times Corrects the Most Trivial of Three Errors From Its Article on DNA, Statistics, and Cold Hits”

If the math is wrong, it doesn’t matter how you express it. And nearly every mathematical construction in both the Times and in the suggested correction(s) is wrong. Not wrong in expression, but wrong in concept.

Not that it matters as the REAL probability is uncomputable: what is the chance that the perp was in the database? In the end, nothing else really matters.
Kevin Murphy (0b2493) — 5/29/2008 @ 8:56 am
Good for you, Patterico, for bird-dogging this down until the Times reporter reluctantly whispered, “auntie”. Shouting “Uncle’ is something that the LAT will never do. Even after the doors close.
C. Norris (abe9e9) — 5/29/2008 @ 9:12 am
That was a very well constructed correction in a nicely written explanation of theoretical probability concepts! Bravo!
Jack Klompus (cf3660) — 5/29/2008 @ 9:43 am
Because comments are closed for the No on 98/99 thread (why I might ask?). I have to post in this thread.

Rent control is evil, and its doom and restrictions on Kelo type takings are exactly why you should vote Yes on 98.
gabriel (6d7447) — 5/29/2008 @ 10:12 am
Justin has an annoying habit of closing comments for all his posts. It must be nice having the last word.
Xrlq (b71926) — 5/29/2008 @ 11:14 am
It could be he doesn’t want his posts to turn into the kind of toxic environment we’ve seen in some other recent threads.
aphrael (e0cdc9) — 5/29/2008 @ 11:21 am
Perhaps so, but I find that approach a bit like curing dandruff through decapitation.
Xrlq (b71926) — 5/29/2008 @ 1:24 pm
Or perhaps his position is undefended and he knows it.
gabriel (6d7447) — 5/29/2008 @ 1:38 pm
This series is an example of the kind of posts that keep me coming back to your blog despite the fact that your political stuff sometimes really pisses me off.
Phil (0ef625) — 5/29/2008 @ 9:02 pm
Not that it matters as the REAL probability is uncomputable: what is the chance that the perp was in the database? In the end, nothing else really matters.

Uhhmmm, not neccesarily. You don’t have to have this probability. Ultimately we are interested in,

Prob(G|M=1) = P(M=1|G)*P(G)/P(M=1).

The fact that Puckett was a match does make it more likely that he was indeed the attacker. However, given how weak that evidence was, I don’t think it is reasonable to conclude that this evidence is sufficient.

Of course if we new that the killer was in the database, then the problem is trivial. But we don’t know that, we can’t really know that, and hence we have to rely on the logic of probabilistic reasoning to try and asses what value this kidn of evidence is worth. In this specific case, it doesn’t strike me as very good.
Steve Verdon (94c667) — 5/30/2008 @ 1:55 pm
Uhhmmm, not neccesarily.

Yes, necessarily. One person matched. Only one. That person was either a random match, or a match to the actual killer. Without knowing how likely each of those two possibilities was to occur, there is no way of knowing which is more likely to have happened in this case, or by how much. Prior to the database search, the odds of a random match were 1 in 3. After the fact, they’re 1 in splunge.
Xrlq (b71926) — 5/30/2008 @ 2:43 pm
Yes, necessarily. One person matched. Only one. That person was either a random match, or a match to the actual killer.

No, it is not necessary to know whether or not the killer is in the database to utilize that evidence as was seen from the equation in my previous post (#10). In fact, I’d argue that going down that road is extremely complex, and since it isn’t necessary why go that way.

Basically, the problem is calculation

P(G)/P(M=1).

P(G) is the prior probability of guilt prior to observing the evidence. If that probability is less than 0.22 then the probability of guilt given the evidence (i.e. one match) is less than 1. If the prior probability of guilt is say 0.11 then probability of guilt given the evidence is 0.5. If you take as your prior the probability of 1/338,000 then the probability of guilt given one match is 0.000013448, which looks ridiculously small, but considering that it is 455% larger than the initial prior it is huge increase in the probability that Puckett is guilty. Then using the 0.000013448 number we can update again with additional evidence such as Puckett being in the SF area around the time of the murder.

The bottom line is that a match does mean that the probability that Puckett is the killer is higher. Does it raise that probability beyond a reasaonable doubt? I guess that depends on what you think is a reasonable prior probability of guilt and taking into consideration our tradition of assuming a person is innocent until proven guilty.
Steve Verdon (4c0bd6) — 5/30/2008 @ 4:08 pm
Your equation is meaningless. One of two things happened. Either the database matched one person randomly, or the true killer was in the database and it returned a match to him. Either you know how likely each of those two possibilities was, or you don’t. If you don’t know anything about one, knowing everything there is to know about the other won’t help you. It’s like asking if 1,000,000/n is a big number or a small one. Without knowing the value of n, any answer to that question is nothing more than a SWAG.
Xrlq (62cad4) — 5/30/2008 @ 5:33 pm
Your equation is meaningless.

No, it is called Bayes Theorem or Bayes Rule and is how you’d try to answer the pertinent question. Which is, how much weight should the DNA match go towards the guilt, or innocence, of Puckett.

One of two things happened. Either the database matched one person randomly, or the true killer was in the database and it returned a match to him.

Yes, but we can’t tell which. So we have to weigh the value of the evidence probabilistically. As such that means we trot out Bayes theorem. You’d do the same thing for say a positive drug test, postive test for cancer, etc. if there was a chance of false postives. You can do the same here.

It’s like asking if 1,000,000/n is a big number or a small one. Without knowing the value of n, any answer to that question is nothing more than a SWAG.

There is no wild ass guessing here. We know that once P(G) = 0.22 or higher then Puckett is guilty. Of course, P(G) = 0.22 is pretty damned high to begin with and hence is probably too high an initial prior.

Using Bayes theorem in this way for other problem like this is well established. Consider the following problem,

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

Let P(B) = 1% of women who have breast cancer. This is our prior and it is the analog to P(G) for Puckett.

Let P(Pos|B) = the 80% of with breast cancer who get a positive mamogram. This is our analog to P(G|M=1).

Let P(Pos|~B) = the 9.6% of women who don’t have breast cancer who get false positives. This doesn’t have a direct analog in the formulation I’ve used.

Now a woman you know gets a positive mamogram, how worried should you be? The probability you want answered then is,

P(B|Pos) = P(Pos|B)*P(B)/P(Pos)

We don’t know P(Pos) so we’ll have to calculate it,

P(Pos) = P(Pos|B)*P(B) + P(Pos|~B)*P(~B).

The above follows from the theorem of total probability. We know that P(~B) = 1 – P(B) or 0.99. Further that P(Pos|B) = 0.8 and that P(Pos|~B) = 0.096, thus P(Pos) = 0.10304.

P(Pos) by the way, is the analog to P(M=1) in the case of Puckett.

The numerator is, 0.008. Thus the probability this woman you know actually has cancer is, 0.078 or 7.8%. Some cause for concern, but before you panic, do some additional tests to verify that cancer is present. Surprisingly, many doctors screw this up, and conflate P(Pos|B) with P(B|Pos) which is an error slightly in excess of an order of magnitude.

In the case of Puckett the crucial part is what should be the initial P(G)? I’d say that initially it should be 338,000 since that is what they ran the sample against. Then you take the revised prior and do similar calculations with the remaining data that has been observed. If the police had removed people from the database that couldn’t have possibly done the crim (i.e. weren’t born, were incarcerated at the time of the crime, etc.) then you could go with a lower prior (and a lower probability of false positives). But at the point in time of running the DNA through the database the belief was that anyone of them could have been the rapist/killer, hence the prior of 338,000.

Police should keep this kind of thing in mind when running DNA through a database. Reduce the number of people in the database wherever possible. If a person in the database wasn’t born, was incarcerated, was known to be on the otherside of the country, etc. remove them to increase the prior probability of guilt and also reduce the probability of a false positive.
Steve Verdon (4c0bd6) — 6/2/2008 @ 10:03 am

5/29/2008

L.A. Times Corrects the Most Trivial of Three Errors From Its Article on DNA, Statistics, and Cold Hits

14 Responses to “L.A. Times Corrects the Most Trivial of Three Errors From Its Article on DNA, Statistics, and Cold Hits”

Favorite Sites

Links

Patterico Sells Out