Patterico's Pontifications

4/18/2009

Ed Humes on DNA, Probability, and Cold Hits

Filed under: General — Patterico @ 4:58 pm



In the latest edition of California Lawyer, Ed Humes has an article on that DNA controversy that the L.A. Times stirred up a few months back.

It’s a continuation of the L.A. Times distortions on this topic. Although I think Humes understands the issues better, I don’t think he makes them any clearer for the casual reader.

The article revolves around the case of John Puckett, a San Francisco man who was charged with murder due to a cold hit from a DNA database. This is the same man about whom the L.A. Times had written a misleading article, which expert David Kaye said had “mischaracterized” a key probability.

Humes is a clever writer who has a way with words. I have read and admired nearly all of his books — but I also recognize that he a) is an advocate for criminal defendants, and b) knows how to phrase damaging facts to downplay them. For example, when he tells us that criminal defendant Puckett has served prison terms for multiple rapes, Humes introduces the damaging facts with this sentence: “Still, Puckett was no Boy Scout.”

Humes writes:

During Puckett’s San Francisco trial last January, the prosecution’s expert estimated that the chances of a coincidental match between the defendant’s DNA and the biological evidence found at the crime scene were 1 in 1.1 million. This, no doubt, gave the jurors a compelling reason to convict Puckett of the killing and send him to prison for at least seven years—if not the rest of his life.

What the jurors didn’t know, though, and what the judge didn’t think they needed to know, is that there’s another way to run the numbers. And according to that math, the odds of a coincidental match in Puckett’s case are a whopping 1 in 3.

This phrasing seems to imply that the same coincidental match could be 1 in a million, or 1 in 3. This is the same error that the L.A. Times made in its article on Puckett’s case. He’s talking about completely different concepts.

The thing is, Humes understands this stuff better than the L.A. Times‘s Maura Dolan and Jason Felch did. And in a very clever (and slippery) move, Humes goes on to explain that both numbers are correct, and then uses the example of the “birthday problem” to illustrate why the two sets of odds are so different. This is misleading in two ways.

1) First, Humes never tells readers, even once, that the “1 in 3″ statistic makes sense only if jurors are told that the hit came from a database — because the only way to explain the meaning of the 1 in 3 number is by reference to a database. But generally, jurors are not told that the hit came from a database — to protect the defendant from being damaged by a jury’s inference that his DNA was in a database because of his criminal record.

If the jury isn’t told about the database, how can you make sense of the 1 in 3 number?

(Interested readers can read an analogy here.)

2) Second, Humes’s use of the birthday problem greatly magnifies the readers’ perception of the likelihood of a match, because the classic birthday problem doesn’t ask the likelihood that you will match someone in the room, but the likelihood that someone in the room will match someone else.

The chance that you will match someone in a room of 23 people is much smaller than the chance that there is a pair in the room that shares a birthday.

The bottom line is that using the birthday problem greatly exaggerates the effect of the database.

In my view, the birthday problem concept has little relevance to a trial like Puckett’s, where there is one man on trial who matches. The only relevance is whether the typical statistics given are undermined by recent data runs conducted by defense attorneys searching an Arizona database. The answer — misleading stories in the L.A. Times notwithstanding — is no.

David Kaye has promised a post on the Humes article on his blog. I will link it when it’s available.

UPDATE: Interestingly, Professor Kaye’s post and mine went up at almost the same time (a coincidence, I assure you). His post is here. I am on my way out the door and haven’t read it yet, but I will comment on it more later. I will note, however, that the first line confirms my own view: “This month’s issue of the California Lawyer perpetuates the confusion in the media about DNA database trawls.”

Indeed. Of course, the confusion all began thanks to the L.A. Times.

23 Responses to “Ed Humes on DNA, Probability, and Cold Hits”

  1. Interestingly, the NY Times suggests that these DNA databases are growing fast. Soon these hits will be out of 50 million records, not 1 million.

    Not that that means the odds of guilt are 50 million to 1 — there still needs to be substantial corroborating evidence. Given that, however, the DNA match is damning as all hell.

    Kevin Murphy (0b2493)

  2. What we desperately need here is the statistical genius of Cyrus Sanai to unravel these conundrums.

    daleyrocks (5d22c0)

  3. I think the number that would be most interesting to the jury is, given the degree of the matching done and the size of the database being searched against, how many members of the jury would be identified as “matching” at least one entry in the database.

    The way I interpret the math, the chance of a match between any random person and an entry in the database is one in 1.1 million. If each entry in the database represents an additional chance for a match, then the odds that a random person will “match” at least one of the entries is about one in three.

    That would seem to imply that if the members of the jury were all tested with the same degree of rigor against that database, (comparing fewer than the full number of alleles needed for a full match) four of them would “match”. Indeed, if the defense attorney had simply run the DNA of a dozen randomly selected people against that database, and counted the number of eight allele matches against the database, that might have seriously shaken the jury’s faith in the defendant’s “match”.

    Karl Lembke (3f9ab0)

  4. Sigh. If I were the judge, I would simply not allow the numbers in. At a 5.6 match, their potential for prejudice far outweighs their probative level.

    nk (0214d0)

  5. Patterico,

    You are wrong on the relevance question again–but you did a very nice job of making it obvious HOW you are wrong.

    If the evidence is corroborative, then the correct statistic is 1 in a million. If the evidence is identifying, then the correct statistic is 1 in 3. (I’ve said this before, and you just can’t seem to get your head around it).

    I will explain one more time what the distinction is. If you already have the accused identified as the rapist by other good evidence–eyewitness, fingerprints, whatever–then the question is, “what are the odds that this guy here, who all the other evidence points to as the rapist, is in fact innocent but shares the DNA of the actual rapist?”. The answer, of course, is pretty unlikely, one in millions. This is the CORROBORATIVE; you have other evidence pointing to the guy, so the question is not whether there are any two matches, but whether THIS GUY could be a match.

    However, if the ONLY meaningful evidence is the DNA, then the correct odds of misidentification are 1 in 3. Why? Because in the identifying case, there is a match, and then law enforcement picks up THAT GUY, whomever he is is, and arrests him. In that case, the question is what are the odds that there are at least two matches? You don’t ask the question “what are the odds that this guy matches someone else” because the DNA evidence is already used to pick him out in the first place.

    Cyrus Sanai (ada6da)

  6. Like a bad penny…

    AD (65649f)

  7. He does not share the DNA of the actual rapist anymore than three hundred and thirty-three other other people, maybe you among them, do, Cyrus.

    nk (0214d0)

  8. And that’s only according to a different database which may be representative of the actual population or may not be.

    nk (0214d0)

  9. I’m not doing the math on this again.

    Y’all can look up the old threads if you want to see my reasoning.

    Bayes’ Theorem provides the proper mode of analysis. Mr. Sanai’s views on the statistics are unorthodox and lack credibility.

    If the evidence is identifying, that does NOT mean it’s 1-in-3. For example, if we DNA tested two million Chinese people born after 1980, and we found one who matched the DNA profile of the murderer, the chance of guilt is not 1 in 3. It’s exactly zero.

    Likewise, more than three people in the U.S. match the killer’s DNA profile. They can’t all have a 1-in-3 chance of committing the crime.

    Bayes’ Theorem teaches us that learning more information cannot, by itself, provide the exact probability unless the information is conclusive, such that there is a 100% chance or 0% chance.

    Bayes’ Theorem teaches us that learning new information that is not conclusive can only modify our prior views. We take the new information (the results of the DNA test), and use that to modify our old information (the likelihood that we thought, prior to the DNA test, Mr. Puckett was guilty of the crime).

    There is no other proper way to perform this analysis.

    Daryl Herbert (b65640)

  10. The way I interpret the math, the chance of a match between any random person and an entry in the database is one in 1.1 million.

    No, no, no, no, no.

    You’re falling for the LAT/Ed Humes oversimplification.

    The chance of a match between any random person WITH A FULL NONDEGRADED DNA SAMPLE and an entry in the database is going to be one in several trillion or quadrillion or something like that.

    Patterico (cc3b34)

  11. And I encourage anyone who wants to really understand this to ignore Cyrus Sanai, and read the David Kaye piece.

    Patterico (cc3b34)

  12. The only relevant statistic is what number of residents of the US (or whatever) would match the crime scene sample to the same degree. In this case I would presume the answer is “hundreds.”

    Is that enough, with no other evidence, to convict someone? No. Is it enough if, say, his car got a parking ticket nearby at the same time? Guilty, guilty, guilty.

    But the rest of these pseudo-statistics are just boring.

    Kevin Murphy (0b2493)

  13. The way I interpret the math, the chance of a match between any random person and an entry in the database is one in 1.1 million.

    No, no, no, no, no.

    You’re falling for the LAT/Ed Humes oversimplification.

    No, I simply thought we were referring to the specifics that apply to the Puckett case. If I have to explicitly refer to them, I suppose I can go back and copy the list for pasting in to each post. I didn’t think you wanted that.

    However, given degradation of the DNA sample such that only 5½ of the possible 13 markers are available, the way I interpret the math, the chance of a match between any random person and an entry in the database is one in 1.1 million.

    But there are 338,000 entries in the database.

    Each entry represents one chance of hitting that one in 1.1 million jackpot. The odds of missing all of them is (1 – 1.1 million)^338,000 = 70.8%

    That means the odds of matching at least one entry in the database is 100 – 70.8 = 29.2%.

    This is the case for a random person walking down the street — if you took a sample of his or her DNA, degraded away 7½ of the markers, and ran the database trawl with what was left.

    Similarly, if you took a DNA sample from each juror, degraded away 7½ of the markers, and ran the database trawl with what was left, you’d expect to find an average of 3.5 matches.

    If a DNA match of this quality were the only piece of evidence against a defendant, I’d have to conclude there’s plenty of reasonable doubt.

    Karl Lembke (3f9ab0)

  14. I seem never to hear about people who were eliminated from suspicion because of testing. Prior to testing we have suspects A, B, and C. DNA from the crime scene is run against the database, totally eliminating suspects A and C. Suspect B is still in the running, but he has an airtight alibi (having been on live television, watched by millions)… We then find that it was Suspect B’s Butler (who, we find out, is Suspect B’s older half brother, sired by Suspect B’s father – who never knew he had sired a child prior to Suspect B. – The Butler’s mother never told Suspect B’s father that she was even pregnant, let alone given birth to his firstborn son.)

    I can be contacted for the rights to this story.

    kimsch (2ce939)

  15. First, Humes never tells readers, even once, that the “1 in 3″ statistic makes sense only if jurors are told that the hit came from a database — because the only way to explain the meaning of the 1 in 3 number is by reference to a database.

    I disagree. The relevance of the 1 in 3 figure derives from the fact that the defendant was selected from a database trawl, not from whether or not the jury was told about that crucial fact. Unless the goal is to keep one’s lies consistent, excluding the true fact of the 1 in 3 figure solely because the judge also excluded the true fact of the database makes no sense whatsoever.

    Of course, as we’ve previously discussed, leading a jury to think that the odds of a single match going wrong are 1 in 3 would be pretty screwy, too. So the best solution would either be to tell the jury about the database, or to caution them generally that for all they know, Big Brother could have pulled this out of a secret database of the entire world’s population, ergo, nothing less than a 1 in 10 billion probability should ever, by itself, be construed even as a preponderance of the evidence, let alone proof beyond reasonable doubt that the individual on trial is actually the perp, rather than just one member of a small group of potential perps, all but one of whom are in fact innocent.

    But generally, jurors are not told that the hit came from a database – to protect the defendant from being damaged by a jury’s inference that his DNA was in a database because of his criminal record.

    That might make sense if the offenses themselves were inadmissible at trial, but once the underlying offenses come in, as they did in Puckett’s case (and routinely do in California when the past offenses are sex offenses), excluding evidence of the database itself protects nothing – except, of course, a prosecutor’s ability to mislead a jury into believing the defendant was chosen by a process that stood a 1 in 1.1 million chance of getting the wrong guy rather than 1 in 3.

    I agree that the birthday problem is a non-issue. We should be concerned about the risk of any innocent person randomly matching to one specific individual – the perp – and not about the possibility that anyone under the sun might randomly match to anyone else.

    Xrlq (62cad4)

  16. xrlq–

    There are no computable odds about “getting the wrong guy” as we have utterly no clue about the probability of the perp being in the database.

    As you point out, he is “just one member of a small group of potential perps, all but one of whom are in fact innocent.” Those odds we can compute, and we can then decide if the DNA evidence adds anything to other evidence we might have.

    Kevin Murphy (0b2493)

  17. Kevin, while it’s true that computing the odds of getting a match to the wrong guy and no one else requires one to know the odds of the perp being in the database, computing the odds of getting the wrong guy, period, does not. No individual, the perp included, can account for more than one record in a sea of 338,000. That means that if the perp is not in the database, we have records from 338,000 innocents, each of which stands a 1 in 1.1 million chance of randomly matching to the perp. If the perp is in the database, we have “only” 337,999 such records, each of which still stands a 1 in 1.1 million chance of randomly matching to the perp. The difference between the two is negligible; either way, we stand roughly a 1 in 3 chance of randomly matching to someone who isn’t the perp.

    Xrlq (62cad4)

  18. X,

    Indeed, but there is a great danger is the jury concluding that the argument is: “a 1 in 3 chance that the search DID hit on an innocent person.”

    Too many smart people have too much trouble with the distinction. And, contrary to your argument, there’s no way to communicate the concept to the jury if they don’t know about the database.

    It’s like telling people that they really have a 1 in 3 chance that any random person will share their birthday. “Gee, I thought it was 1 in 365, but I don’t understand statistics, so if you say so . . .”

    Patterico (cc3b34)

  19. It’s all well and good to argue that the 1 in 3 figure should not have been admitted since juries are too dumb to distinguish anterior from posterior odds, but if that’s the objection, what on earth justification can there be for admitting the 1 in 1.1 million figure, either? Both figures represent the original odds that a a particular event would happen; neither tells you anything about the posterior odds that the event in question did. The only difference between them is that one represents the anterior odds associated with the process that actually occurred, while the other represents the anterior odds associated with a process the jury had been misled into believing had occurred.

    The birthday analogy doesn’t work because we’re not talking about randomly selected individuals. Puckett wasn’t chosen at random; he was fished out of the database specifically because his record yielded a match. To make the birthday equivalent work, start with 20 randomly selected people, or whatever other number one needs to end up with 1 in 3 odds that someone’s birthday will match to someone else’s. Then find some way of surreptitiously learning all 20 birthdays, and then have someone identify two individuals from the list. If any two birthdays among the original 20 match, he will select two individuals whose birthdays match. If none do, he’ll instead select two individuals at random. He then gives you the names, without telling you whether they matched or not. Now, you address those two individuals. Since they weren’t told about the selection process, should you lie and tell them that there is only a 1 in 365 chance that their birthdays match? Or should you tell them the truth and say that the chances are 1 in 3?

    Xrlq (62cad4)

  20. The birthday analogy doesn’t work because we’re not talking about randomly selected individuals.

    Actually, it doesn’t work for the reason stated by Prof. Kaye:

    Finding that this particular profile matches at least one in the database is much less likely than finding at least one match between all pairs of profiles in the database. The latter event is the kind that is at issue in the birthday problem. See David H. Kaye, DNA Database Woes: What Is the FBI Afraid Of? (under review). It is not involved in a cold hit to a crime-scene profile.

    Your proposed analogy doesn’t work. In your analogy, the proposed odds apply to the pair that was picked. In Puckett’s case, the 1 in 3 odds do not apply to Puckett. Thinking that they did was the primary mistake the L.A. Times made. The same is arguably true of your phrasing: “The only difference between them is that one represents the anterior odds associated with the process that actually occurred, while the other represents the anterior odds associated with a process the jury had been misled into believing had occurred.”

    Patterico (cc3b34)

  21. As anterior odds go, the 1 in 3 odds certainly do apply to Puckett, for the same reason they apply to the birthday pair: he was chosen not at random, but by a process that had a 1 in 3 chance of yielding a particular result. This is not the classic birthday problem, where we ask how likely anyone is to match to anyone else. It’s more like the common sense observation that if you are in a room with 22 other people, the odds are now 22 in 365, and not 1 in 365, that someone in the room will have the same birthday as you.

    All this begs the question of whether it is appropriate to introduce any anterior odds to a jury without providing them with a reliable mechanism for deriving the posterior odds they actually care about (which we can’t do since we know how likely a random match was to occur, but have no clue how likely a true match was). If your answer to that question is yes, the 1 in 3 figure should have been admitted. If it is no, neither figure should have been.

    Xrlq (62cad4)


Powered by WordPress.

Page loaded in: 0.2643 secs.