Patterico's Pontifications

5/6/2008

Volokh on DNA and Cold Hits

Filed under: Crime,Dog Trainer,General — Patterico @ 7:05 am

Eugene Volokh has deftly isolated the major flaw in the recent L.A. Times article on DNA, cold cases, and statistics.

In my original post I quoted the language from the article that most disturbed me:

At Puckett’s trial earlier this year, the prosecutor told the jury that the chance of such a coincidence was 1 in 1.1 million.

Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person.

In Puckett’s case, it was 1 in 3.

. . . .

In every cold hit case, the panels advised, police and prosecutors should multiply the Random Match Probability (1 in 1.1 million in Puckett’s case) by the number of profiles in the database (338,000). That’s the same as dividing 1.1 million by 338,000.

For Puckett, the result was dramatic: a 1-in-3 chance that the search would link an innocent person to the crime.

In my original post I said:

It seems to me that the conclusion does not logically follow at all. The formulation simply can’t be right. The suggestion appears to be that the larger the database, the greater the chance is that the hit you receive will be a hit to an innocent person. I think that the larger the database, the greater the probability of getting a hit. Then, once you have the hit, the question becomes: how likely is it that the hit is just a coincidence?

Volokh explains the ridiculous nature of the L.A. Times‘s formulation with an excellent example:

Here’s one way of seeing this: Let’s say that the prosecution comes up with a vast amount of other evidence against Pickett — he admitted the crime in a letter to a friend; items left at the murder site are eventually tied to him; and more. He would still, though, have been found through a search of a 338,000-item DNA database, looking for a DNA profile that is possessed by 1/1,100,000 of the population — and under the article’s assertion, “the probability that the database search had hit upon an innocent person” would still have been “1 in 3.”

Despite all the other evidence that the police would have found, and even if the prosecutors didn’t introduce the DNA evidence, there would be, under the article’s description, a 1/3 chance that the search had hit upon an innocent person (Pickett), and thus a 1/3 chance that Pickett was innocent, presumably more than enough for an acquittal. That can’t, of course, be right. But that just reflects the fact that 1/3 is not “the probability that the database search had hit upon an innocent person.” It’s the probability that a search would have come up with someone innocent if the rapist wasn’t in the database.

I think that’s exactly it. I believe the reason is that inclusion of a known guilty person in the database corrupts the math involved in pure probabilities of finding an innocent person.

I think Eugene has hit upon an actual error in the piece with this, and not just a matter that’s open to debate. I don’t think they would ever correct it, because they have a history of failing to correct errors if the explanation of the error is long and difficult — even if it’s unquestionably an error. Still, when I have more time, I’ll follow up on this more.

Read Volokh’s entire post, which has other illuminating insights, here. Previous posts on this subject here, here, and here.

Comments (50)

50 Responses to “Volokh on DNA and Cold Hits”

That makes sense. The database is about one third in size of the occurrence of that match in the general population. So if you are looking among 338,000 people for something that occurs in only one of 1,100,000 people, your chances of finding it will be about one out of three. But I would say “guilty or innocent”. Anyone.
nk (1e7806) — 5/6/2008 @ 7:15 am
The Times’ math is still (almost) correct.

As I mention here, if all you have is a “cold hit” in a DNA database, you have about one chance in four of having gotten a match in that database even if the real perpetrator is not listed.

Eugene Volokh has added other factors:

Let’s say that the prosecution comes up with a vast amount of other evidence against Pickett — he admitted the crime in a letter to a friend; items left at the murder site are eventually tied to him; and more.

That’s different. If the corroborating evidence pans out, we’re no longer looking at just the DNA hit. We’re looking at the DNA hit, and the admission in a letter, and the other items connecting him to the crime, each of which has its own long odds against happening by chance.

And when you write:

I think that’s exactly it. I believe the reason is that inclusion of a known guilty person in the database corrupts the math involved in pure probabilities of finding an innocent person.

No. A person may be in the database because he’s known to be guilty of something, but that doesn’t make him “known guilty” of that particular crime.

No. The genetic match, at the level described in the article, may be probable cause to investigate the suspect and turn up the other evidence, but there’s still a 26% chance that of matching someone in the database even if the actual rapist is not listed. That’s the mathematical fact.
Karl Lembke (ff486c) — 5/6/2008 @ 8:25 am
Karl,

We’re not communicating at all and I don’t have time to explain all the ways you’re misapprehending what I said. Here’s the quickest way to say it:

“theres still a 26% chance that of matching someone in the database even if the actual rapist is not listed.”

Replace “even if” with “only if” and you’ve got it.
Patterico (ffe933) — 5/6/2008 @ 8:30 am
And if you’re saying the LAT math is right, you haven’t understood Eugene’s post.
Patterico (ffe933) — 5/6/2008 @ 8:31 am
The problem with the LAzy Times reporting, is that they are basing statistics upon a foundation of compounding syllogistic fallacies.

By the LAzy Times reporting methodology, since we share 98% of our DNA with chimpanzees, the likelihood that Ted Bundy is innocent and Cheetah was actually out looking for women with straight hair parted down the middle is 1 in 3.

And John Wayne Gacy was innocent, but Bonzo had a thing for burying young boys in his attic was 1 in 3.

They are confusing the science with irrelevant statistics and then combining the two.

Forensic evidence tends to come not in isolation, but in combination. For instance, if a Ferragamo footprint, size 10 1/2 is found at the scene, we can eliminate Cheetah, since she was known to wear Jimmy Choo’s. Bonzo, of course is not out of the woods yet, according to the LAzy Times.

Back here in the real world, we would tend to focus on humans with that shoe size. We might also find a number of hairs and fibers at the scene.

The CODIS database for known violent offenders is simply a place to start, when matching a crime scene invovlving violence.

As we compound the evidence, each marker that matches…included and excludes potential violent offenders. It doesn’t just do one or the other.

Adding each successive marker as a match, eliminates cards from the deck, if you will…and narrows down the cards to choose from. However, combined with other evidence at the scene, certain cards are much less likely to be involved.

The queen of hearts has a size 6 shoe, she has flaming red hair, she’s 4 ft 10 inches and weighs 90 lbs, she is not likely to be able to carry a 200 lb body… 400 yards through the woods and thicket.

Common sense and additional forensic evidence is not left on the courthouse steps in a DNA forensics case. It is the totality of the evidence, (motive, opportunity, methodology, etc) that give us that totality.

The DNA evidence gives us a SUBSTANTIALLY higher modality of matching than eye witness testimony alone would. In combination with other factors, it is the single most powerful tool in prosecution AND DEFENSE of defendants.

When you add five matching markers together, and you get a match on all five…PLUS you have evidence of the defendant being in the area on the date and time in question, PLUS you have a history of similar acts committed in a similar pattern….the chances of having the wrong person are infinitely slimmer than any other prior prosecutorial methodology.

Not 1 in 3…not even 1 in 3 million.

But, if you are a leftist…keep a watch out for Bonzo.
cfbleachers (4040c7) — 5/6/2008 @ 8:36 am
The Times’ math is still (almost) correct.

No, it’s not. There is a 1 in 3 chance of any hit in the database. There is a 1 in 1,100,000 chance of hitting any particular person.

That’s not the same as saying there’s a 1 in 3 chance of hitting an innocent person.
Steverino (d6232c) — 5/6/2008 @ 8:45 am
Steverino:

The point is, the match with Puckett was a match any person in the database. It was not a match with a specific person based on an a priori decision to check one individual’s DNA for a match.

Here’s an example from another field.
Statistical results, say in epidemiology, are considered significant if they have a p value less than 0.05. That means, there is only one chance in twenty that the statistical result that’s been obtained was due to pure chance.

So if you suspect a link between lime-flavored chewing gum and oral cancer, and you set up an experiment to test this link, if you get a result that is, say, 0.04, or 1/25, you have a statistically significant result, and you can publish your research.

However, suppose you set up forty tests. You test lime-flavored chewing gum. But you also test cherry-flavored, lemon-flavored, and 38 other flavors. Based on pure chance, two of your tests will be significant at the .05 level or lower.

So have you proved that people who chew lime-flavored (and turnip-flavored) gum are at a greater risk for cancer? Or have you demonstrated that a 20:1 bet will pay off about one time in twenty?
Karl Lembke (1d7861) — 5/6/2008 @ 9:25 am
But that just reflects the fact that 1/3 is not “the probability that the database search had hit upon an innocent person.” It’s the probability that a search would have come up with someone innocent if the rapist wasn’t in the database.

How is the rest of the database supposed to know if the rapist is in it or not? It would seem to me that if there is a 1 in 3 chance of someone getting a false hit if the rapist is not in the database, there would also be a 1 in 3 chance of someone else getting hit if he is. It shouldn’t be too hard to test that theory, since the predictable result would be that one out of three times that the rapist is in the database (often, I presume), cold hits would match two people rather than one.

That said, there is obviously something wrong with the LA Times’s math if they think that the larger the database, the higher the chances that the one hit they got (among a gazillion non-hits they could have gotten but didn’t), the less reliable the data. Wouldn’t the most reliable database in teh world be one that included the entire world’s population yet only matched to one individual?
Xrlq (b71926) — 5/6/2008 @ 10:00 am
We’ve already had a case where someone was a SIX marker match, and the only match in the database, but couldn’t have done the crime. That the LA Times can’t do math should not be news. However, that does not mean that those criticizing them are correct.
htom (412a17) — 5/6/2008 @ 10:18 am
As one commenter over at Volokh pointed out, a court decision on a somewhat similar case was just issued:

http://www.ca9.uscourts.gov/ca9/newopinions.nsf/FEDAEC858794EE338825744000525961/$file/0715592.pdf?openelement
jim2 (a9ab88) — 5/6/2008 @ 10:23 am
Karl-

We all agree that there is a roughly one in three chance of a match. The difference is how we view the meaining of that match.

The LA Times asserts that that is the odds of a match to an innocent person. But how does the LA Times know beforehand that the person identified is innocent?

Patterico’s point is that we don’t yet know if the person is innocent or guilty, only that their DNA matched the suspect.

I would say that the hit identifies a person of interest-not enough to make them a suspect (and certainly not enough to convict), but definitely worth checking out.
MartyH (52fae7) — 5/6/2008 @ 10:26 am
The point is, the match with Puckett was a match any person in the database. It was not a match with a specific person based on an a priori decision to check one individual’s DNA for a match.

You don’t seem to understand how a database is searched. It consists of looking at all the records, and displaying which ones match. So, it’s a series of 380,000 individual tests.

The chance of any one of them matching is 1/1,100,000. The chance of there being a match within the database is roughly 1/3. That still isn’t the same thing as saying there’s a 1/3 chance of a false positive, which is what The Times was saying.
Steverino (d6232c) — 5/6/2008 @ 11:26 am
This is really interesting, because while the court case just issued correctly points out that the prosecution engaged in what’s known as the Prosecutor’s Fallacy, the LA Times does exactly the opposite, and engaged in the Defendant’s Fallacy. Both are incorrect. There’s a fairly good writeup on them in Wiki, if anyone’s interested.

Basically the problem in the court case just listed was that the prosecutor (combined with the expert witness) conflated the odds of a match with the odds of guilt. This is plainly wrong. And then they held that that, combined with a bunch of other facts, decided that it came up to the level of reversible error. Should it have? I don’t know. But clearly both the prosecutor and the expert witness should have known better, and after this decision soaks into their consciousness, they will.

Moving on to the case at hand. what is the probability of an innocent person being identified? It’s the combination of two factors. Factor one is the odds of the guilty person not being in the database. Factor two is the odds of getting exactly one hit in the database. Note, this is not the same as just dividing things, because obviously there was some chance of getting 2 or more hits, and if that had happened we wouldn’t be talking about this.

But for the sake of simplicity I’ll accept the 1 in 3 number. Now how do we get a number for the other factor? Honestly, I don’t know, but let’s say that 90% of all violent criminals have been caught, and are therefore in the database, so there’s a 90 percent chance that the guilty person is in there. That would be a 1 in 10 chance that the guilty person wasn’t in the database, and therefore a 1 in 30 chance that an innocent person was identified. If there’s only a 50% chance that the guilty person is in the database, then it would be 1 in 6.

Now, I have no idea what that number actually should be. But you could probably get a ballpark estimate by using the percentage of violent offenders who are repeat offenders (and therefore would be in the database). Anyone happen to know that number?
Skip (ba6438) — 5/6/2008 @ 11:26 am
You bring up a good point, Skip, which is that the DNA database itself is not a random sampling of the total population’s DNA. It is heavily weighted toward criminals, and therefore represents a larger database of potential offenders than it first appears. This shifts the odds in favor of identifying the correct person in a Bayesian analysis.
SPQR (26be8b) — 5/6/2008 @ 12:23 pm
So, if half the people in the country were in the dB , then you’d get 200 hits but have only a 50/50 chance that one of them was the right person?
quasimodo (edc74e) — 5/6/2008 @ 12:25 pm
ignoring the bias of the data base toward known violent offenders
quasimodo (edc74e) — 5/6/2008 @ 12:26 pm
Sorry to seem dense, but I’m in the dark about a couple things.

I’m assuming that STR analysis is being done on these samples. That gives us a 13-loci total possibility.

I’ve never done DNA procedures, but it would seem that markers make up the “description” of the DNA sample, and a “hit” would constitute some percentage of similarities, or matches. I believe the minimum is 4 or 5 to constitute a hit, if I’ve read correctly.

I would argue that there is an additional variable at play. (# of markers in the evidence sample out of a total possible)

Which leads back to the question of what a “hit” constitutes, and whether all “hits” are the same or not. If they are not, then arguing over the percentages is useless, as it is an improper distillation of unrelated variables. A hit of 13 markers out of a possible 13 would be very different than a hit of 4 out of 13. Calling both events “hits” clouds the statistical issue.

I’m not comfortable trying to label DNA “hits” as “accurate” or “inaccurate”. I keep thinking of the OJ juror who said something like “all that DNA stuff is a bunch of hooey”, which makes me wonder as to the real intent of the Times article.
Apogee (366e8b) — 5/6/2008 @ 12:32 pm
Apogee, go back to the original posts, the discussion arose with respect to an LAT article that discussed a DNA comparison involving fewer markers.
SPQR (26be8b) — 5/6/2008 @ 12:33 pm
The more I think about it, the worse the LA Times’ logic is.

The odds of an “innocent match” in this case is not 1/3. It’s 100% because it actually occurred.

It’d be the same thing as the LA Times reporting-today-that the Giants have a 1 in 3 chance of winning the 2008 Super Bowl because those were the odds before the game was played.

This is an actual case-beyond the realm of probability and into reality. Statistics play no role in the evaluation of the case today.
MartyH (52fae7) — 5/6/2008 @ 12:44 pm
Thanks SPQR
Apogee (366e8b) — 5/6/2008 @ 12:44 pm
SPQR it definitely does. But it weights it towards guilt from the 1 in 3 number, not from the 1 in 1.1 million number. Let’s say that there’s only a 1 in 100 chance that the guilty person is not in the database (and that seems exremely high to me). Then we’d be talking about a 1 in 300 chance of identifying someone who’s not guilty. Would that be enough for reasonable doubt, if I were on the jury? I don’t know, but if there were _no_ other evidence connecting him, it probably would.
Skip (ba6438) — 5/6/2008 @ 12:45 pm
1/3 is not “the probability that the database search had hit upon an innocent person.” It’s the probability that a search would have come up with someone innocent if the rapist wasn’t in the database.

No. Even if the rapist was in the database (as he was), there’s still a 1-in-3 chance that an innocent person would be fingered by the database. (If that happened, the DB would flag two people, the guilty and the innocent.)

Having a certain DNA strand that’s 1-in-a-million to have 5.5 IDs doesn’t make anyone else on the face of the earth less likely to have DNA with identical 5.5 IDs.

—-

In fact, as a matter of probability, we are more likely to share DNA than it first appears.

Postulate 1M people. Each of them has a certain ID#, ranging from 0 to 999,999. However, there are some duplicates, and some numbers that are unused (that would happen if you randomly assigned numbers).

The task: iterate through the group, 1 by 1. Each time you get to a person, you ask “how many people have this identical #”? And then you add that number to a running total.

If everyone had a unique number, the sum total at the end of your calculation would be 1M exactly.

However, some people share #s. Each time you hit on a number that is shared by 2 people, instead of adding 1 to the sum, you would add 2–and this would happen a second time, when you got to the other person in the pair.

If everyone had unique ID#s, every time you counted X people, you would only add X. But where X people share the same ID, instead of adding X for all of them, you would add X^2.

So your sum at the end of the calculation will be well over 1M.

The average number of PEOPLE with an ID#, given any ID number, will always be 1, no matter how IDs are distributed.

But given a PERSON, the average number of other people with same ID will not necessarily be 1. If you divide the sum by 1M, that gives you the average #, given any PERSON, of people who share that ID (including the original person).

—-

Let’s say there’s 100 customers in a supermarket with 10 checkout lanes. I randomly assign people to each lane.

People are MORE LIKELY to end up in a lane with more than 10 people than they are to end up in a lane with less than 10 people. It’s not a 50/50 shot as to whether you’ll be in a short lane or a long lane.

If here are the # of people in each lane after I distribute randomly:
8 10 9 10 11 10 12 10 10 10

Then there are 17 people in the short lanes and 23 people in the long lanes. Your chance of ending up in a long lane is higher than ending up in a short lane, even though the initial distribution of lanes was fair.

—-

I think it’s largely the same way with DNA. To go back to my prior example, when investigators take DNA from a crime scene, they are taking a person’s ID#. They are not randomly generating an ID# from scratch.
Daryl Herbert (4ecd4c) — 5/6/2008 @ 1:05 pm
Daryl, taking DNA is not really “taking a person’s ID#”. First of all, we have things like twins. But ignoring for the moment who might have actually identical DNA, we don’t read someone’s complete DNA with current technology. To oversimplify a bit, we break off pieces of it and indirectly compare those pieces with pieces broken off of another sample.
SPQR (26be8b) — 5/6/2008 @ 1:12 pm
Daryl – Having read a little more, I would tentatively say that in your example, each person’s number would be made up of 13 letters:
Each letter is a variable whose numerical value would be drawn from 10 different possible numbers.

From #23, SPQR’s example, analysis would constitute finding those who had at least five matching values corresponding to the position of the variables in the number.

I think that’s it, although I’m not sure.
Apogee (366e8b) — 5/6/2008 @ 1:52 pm
There are 1.1M variations on the 5 markers in the evidence sample. There are 338K samples in the database. There is, therefore, about a 1/3 chance of finding a match in the database, regardless of any other circumstances.

For that reason, finding a match in the database is not of sufficient probative value to convict.

Take it the opposite way. Suppose we had 338K samples of unknown persons, all of which were found in circumstances that would constitute a perfect alibi. If the suspect’s DNA matched one of those samples, would that be conclusively exonerating evidence? No, because there is a 1/3 chance of a match by coincidence.

The prossecution’s use of the 1.1M to one chance of a match is deceptive, because the search represents 338K trials – nearly 1/3 of the possible combinations.

The odds of any one lottery ticket winning a $10,000 prize are 1 in 20,000. (Lotteries pay out 50%.) The odds of one of 10,000 tickets winning the prize are 1 in 2.

IOW, the rare outcome ceases to be rare with enough trials.

It is not devoid of probative value, but it is the equivalent of a decent eyewitness description.
Rich Rostrom (7c21fc) — 5/6/2008 @ 2:48 pm
I believe Paterico is correct that the LA Times has made an error. Not so much in the math but in what the math means. However I believe the LA Times is correct that the jury should have been told that even if the guilty person was not in the database there was a good chance of getting a match.

Perhaps the collaborative evidence gathered after the match is still sufficient to prove guilt beyond a reasonable doubt but the DNA match by itself is not and this should have been made clear to the jury. Just citing the 1 in a million number is very misleading.

Btw are comments closed for the Levine post intentionally?
James B. Shearer (fc887e) — 5/6/2008 @ 4:12 pm
There are 1.1M variations on the 5 markers in the evidence sample. There are 338K samples in the database. There is, therefore, about a 1/3 chance of finding a match in the database, regardless of any other circumstances.

For that reason, finding a match in the database is not of sufficient probative value to convict.

Your logic is in error here. Let me give you this example:

Suppose we had a database with the DNA of every human being who has ever existed, all 60 billion or so. We have a DNA sample of a criminal, and we run it through the database. Since we’ve got every person who has ever existed, we are guaranteed a hit: a 100% chance. Are you going to argue that finding a match is not of sufficient probative value to convict?

By your logic, if the database were smaller — say only 3380 samples — the chance of finding a hit in the database would be 1 in 300, so that would mean such a hit would be more likely to be the guilty party.

Regardless of the number of samples in the database, the odds of hitting any one in particular are still 1 in 1,100,000. So, if there is a match in the database, the chance that that person is innocent based purely on DNA is 1 in 1,100,000. If there are other factors — say, the person was in Borneo at the time of th crime — those would diminish or even render null the probability of guilt.
Steverino (d6232c) — 5/6/2008 @ 4:27 pm
MartyH writes:

We all agree that there is a roughly one in three chance of a match. The difference is how we view the meaining of that match.

I’m not sure “we all agree” with that number.
In his original post, Patterico writes:

It seems to me that the conclusion does not logically follow at all. The formulation simply can’t be right.

Not “the inference” — “the formulation”.
That’s the math.

You continue:

The LA Times asserts that that is the odds of a match to an innocent person. But how does the LA Times know beforehand that the person identified is innocent?

Well, the obvious answer is that he is presumed innocent until proven guilty. The question being batted around is whether the genetic match, by itself, is sufficient evidence to prove guilt, and if so, to what degree.

At least, that’s how I read it.

Karl Lembke (71d63b) — 5/6/2008 @ 4:33 pm
Steverino writes:

Suppose we had a database with the DNA of every human being who has ever existed, all 60 billion or so. We have a DNA sample of a criminal, and we run it through the database. Since we’ve got every person who has ever existed, we are guaranteed a hit: a 100% chance. Are you going to argue that finding a match is not of sufficient probative value to convict?

No, but that’s miles away from the case discussed in the LATimes article.

Regardless of the number of samples in the database, the odds of hitting any one in particular are still 1 in 1,100,000. So, if there is a match in the database, the chance that that person is innocent based purely on DNA is 1 in 1,100,000. If there are other factors — say, the person was in Borneo at the time of th crime — those would diminish or even render null the probability of guilt.

It seems to me, by your logic, a match is still a match, and if there’s a match, the odds that the person is innocent is one in 1,100,000, even if he was in Borneo at the time. You would have to conclude, based on the match, that he somehow managed to be in two places at once.

That’s obviously absurd.

Let’s go back to the case discussed in the Times.

You have a person who was one member out of 338,000 in a database. A police officer decides to “round up the usual suspects” and submit a DNA sample to see if he finds a match.

He finds one.

Now, given that we have a DNA match, what we’re most interested in is the probability that a person is the perpetrator, given that his DNA matched the sample left at the crime scene.

The answer turns out to be one divided by the number of people in the world who would, theoretically, match the same sample. Since the probability of a a match is 1 in 1,100,000, in a world population of 6.7 billion people, there are some 6,090 people who can be expected to match that sample.

So, using the math of conditional probability, the probability that Mr. Puckett is guilty, based solely on DNA evidence, is 1/6090.

I have a discussion of this, with diagrams, at this post at my own blog.
Karl Lembke (c0c7cf) — 5/6/2008 @ 4:56 pm
The question being batted around is whether the genetic match, by itself, is sufficient evidence to prove guilt, and if so, to what degree.

Even a 13 marker match is not enough to prove guilt. Proving presence at a crime scene does not prove the commission of the crime. I don’t think anyone’s arguing for genetic matches to replace all investigative work. As Patterico argued, the use of terms “innocent” and “guilty” clouds the issue and makes people fearful and then question the Forensic DNA process.
Apogee (366e8b) — 5/6/2008 @ 5:03 pm
“there’s still a 26% chance that of matching someone in the database even if the actual rapist is not listed”

OK, after thinking about it more, Karl, I think this statement of yours is accurate — it’s just not what the LAT is saying.
Patterico (4bda0b) — 5/6/2008 @ 5:51 pm
I believe Paterico is correct that the LA Times has made an error. Not so much in the math but in what the math means.

I think that’s exactly right.

However I believe the LA Times is correct that the jury should have been told that even if the guilty person was not in the database there was a good chance of getting a match.

Well, that’s more a legal question than a statistical one. But your argument for telling the jury that is much stronger if you assume that the jury was told about the database hit — which it wasn’t. Since it wasn’t, there’s not necessarily a need to correct the “what are the chances?!” style misimpression that would have been caused if the jury had been told that.

Granted, the jury might still wonder why this guy is sitting in front of them — but they are generally told to ignore the fact of a defendant’s arrest in determining guilt or innocence.
Patterico (4bda0b) — 5/6/2008 @ 5:54 pm
No. Even if the rapist was in the database (as he was), there’s still a 1-in-3 chance that an innocent person would be fingered by the database.

But we’re dealing with the scenario where we got one hit. If we assume the rapist is in the database, then there is 1) a certainty of a true hit, and 2) a 1/3 chance of a false positive among other people. Using that assumption (we know a guilty person is there), the fact that we got only one hit tells us there was no false positive.

Now, if we instead assume that we *don’t know* whether the rapist is in the database, we can’t talk in terms of the probability of a “false positive” — just the probability of any match at all. Once we get that match, we won’t know whether it’s false or not — at least not solely based on statistical analysis. I don’t even know how we could talk in terms of how to say, statistically speaking, what the chances are that the match is false or not.

We can, however, tell the jury that assuming a database of innocent people there still would have been a 1/3 chance of a false positive. (I wouldn’t tell the jury this unless the jury heard evidence of a database hit, which they generally don’t.) That would be a correct statement of the statistics.

But that’s NOT the same as saying: there’s a 1/3 chance that the guy sitting in front of you is innocent.

THAT’S what the LAT did, and that’s what was so wrong about their article.
Patterico (4bda0b) — 5/6/2008 @ 6:02 pm
#30 Apogee:

Which leads back to the question of what a “hit” constitutes, and whether all “hits” are the same or not.

What’s actually tested are DNA fragments of varying lengths through a seiving gel, resulting in discrete numbers representing those lengths. When all the loci of interest are present in the sample, what results is a set of variable arrays, the length of the arrays dependent on how many different lengths of fragments show up in the test, and the elements of the arrays are those lengths.

If you have only 5 loci present in the sample, you still have a very precise identification, but only to certain limits. Ie, if the perpetrator of a crime left a sample at the scene that has an array of fragment lengths at locus name composed of 3 elements, and the elements are 6, 10, and 12 but the guy we are investigating, Joe Butthead, has an array of 2 elements at locus name, he is immediately excluded from having left the sample. Cindy Lou Reprehensible is excluded because at name she has 3 elements in the array, but they are 6, 8 and 12.

The problem with not having a complete set of loci to work with is that you can only exclude x number of people that way because the remainder match at all the loci available.
EW1(SG) (84e813) — 5/6/2008 @ 6:41 pm
The LAT’s math error:

If Mr. Puckett was innocent, like everyone else in the DB, then there is a 1/3 chance that Mr. Puckett is innocent.
Daryl Herbert (4ecd4c) — 5/6/2008 @ 6:46 pm
32

“Well, that’s more a legal question than a statistical one. But your argument for telling the jury that is much stronger if you assume that the jury was told about the database hit — which it wasn’t. Since it wasn’t, there’s not necessarily a need to correct the “what are the chances?!” style misimpression that would have been caused if the jury had been told that.”

If the defense was not allowed to tell the jury that the DNA match was the result of a search through a large DNA database I consider that completely unfair and think the conviction should overturned.
James B. Shearer (fc887e) — 5/6/2008 @ 6:58 pm
EW1(SG) – That’s not my #30, it’s actually my #17, when I first read the post and before I read the earlier posts. Patterico’s #32 & #33 pretty much sum it up (pardon the math pun) for me. The Times article seems to imply that the guy has a 1/3 chance of being innocent, which to jurors conjures up the vision of the police seeing any group of 3 people and grabbing one of them randomly, which is an inaccurate picture of the process.

Thanks for the explanation, though. It would seem that this is not simple or straightforward at all (as expressed by the court in the 9th circuit ruling). Viewing the input of all the parties commenting, most of whom I consider intelligent and thoughtful, is eye-opening considering the divergence of opinions on the subject and its description.
Apogee (366e8b) — 5/6/2008 @ 7:12 pm
James B. Shearer – if the defense was not allowed to tell the jury that the DNA match was the result of a search through a large DNA database I consider that completely unfair and think the conviction should overturned.

I don’t agree – although I’ll readily say that I’m uncomfortable with jurors being told that the possibility of the hit was 1:1,100,000. It may be accurate to a degree, but even comments here make the mistake of applying the words guilt and innocence to the match. A DNA hit does not prove guilt, plain and simple. Jurors need to know what it does and does not prove.
Apogee (366e8b) — 5/6/2008 @ 7:24 pm
My innumeracy on previous threads has brought me great shame. Let me try to fix all that:

If we start from scratch, without knowing how many DB hits there will be, there is about a 1/3* chance of getting an innocent hit, whether or not the guilty person is in the DB.

BUT knowing that exactly 1 hit is returned, we have to rule out some things that were previously possible.

BEFORE we know how many hits there will be, there could be:

CH_IF_OUT == chance of that result, IF the offender is out of the DB

CH_IF_IN == chance of that result, IF the offender is in the DB

* as was pointed out on the first thread, it’s not 1/3. I have used the correct probabilities in the chart below.

POSSIBLE RESULT CH_IF_OUT CH_IF_IN
0 hits 73.54% 0.00%
1 hit 22.60% 73.54%
2 hits 3.47% 22.60%
3 hits .36% 3.47%
4+ hits .03% .39%

That’s what we would expect before we know how many hits the DB will return (and thus obviously, before we know whether our suspect is in the DB or not)

If we know there is exactly 1 hit, we can rule out every possibility not involving exactly one hit.

Assuming it’s 50/50 that the suspect is in the DB (why would we do that????) then there is a 1/4* chance that, any time we see there is only a single hit from the DB and know nothing else (assume it is a DB from the general pop, and not a DB of sex offenders), that the subject is innocent.

* actually 23.50%

But we can’t assume a 50/50 chance that the suspect is in the DB or not. That calculation is inherently illegitimate. The 23.50% figure is meaningless.

We have no way of knowing how often the DB really returns the guilty person rather than an innocent person, because the DB is based on sex offenders (not general pop.) and the facts of each case (the source of each DNA sample put into the DB) are different.

For frat boy date rape cases, the chance a DNA sample will be in the DB may be very low (let’s say 1%, a made-up number). If it’s really that low (most of the guys in the sex offender registry are probably old dudes and/or in prison, and not terrorizing sorority girls at frat parties), then finding a match from the DB has low meaning: there would only be a 3.2% chance that the DB hit was actually guilty of the crime. That’s worth looking into, but it’s not much.

For a long-ago stranger rape + murder, like the instant case, the chance a DNA sample will be in the DB may be much higher (say, 70%, again, a made-up number). It would be higher because, if it was committed by a serial rapist, he might have been caught for other crimes and be in the DB. Maybe we really have DNA on 70% of the serial rapists in CA who attack adult women.

If we knew there was a 70% chance, in long-ago stranger-rape-murder cases, that the offender’s DNA would be in the DB, then upon finding out that there was a single “hit” from the DB, the chance that we’ve got the correct suspect is actually 88%! In other words, there is only a 1 in 9 chance that the police got the wrong guy.

At least, I think that’s how bayesian math works:

73.54 x 70 / (73.54 x 70 + 22.60 x 30) == .8836
Daryl Herbert (4ecd4c) — 5/6/2008 @ 7:45 pm
38

“I don’t agree”

Why not? It seems to me that basically the defense should be allowed to present any evidence it wants unless there is a good reason to exclude the evidence. What is the good reason for keeping this secret from the jury?
James B. Shearer (fc887e) — 5/6/2008 @ 8:00 pm
Sorry, my table appeared to work in the preview.

If suspect IS in the DB, here is the chance of getting certain results:

0 hits ___ 0.00%
1 hit ___ 73.54%
2 hits ___ 22.60%
3 hits ___ 3.47%
4+ hits ___.39%

If suspect is NOT in the DB, here is the chance of getting certain results:

0 hits ___ 73.54%
1 hit ___ 22.60%
2 hits ___ 3.47%
3 hits ___ .36%
4+ hits ___ .03%
Daryl Herbert (4ecd4c) — 5/6/2008 @ 8:10 pm
Patterico:

OK, after thinking about it more, Karl, I think this statement of yours is accurate — it’s just not what the LAT is saying.

You’re probably right with that. Explaining is not easy. You have to know the subject you’re trying to explain, and you have to know how to put it across to people in terms they can relate to.

Generally, two rare skills.
Karl Lembke (0bdd2a) — 5/6/2008 @ 8:22 pm
@Karl

That’s what I was trying to say the other day. Thanks for clearing it up.
chad (582404) — 5/6/2008 @ 8:33 pm
Apogee:

The Times article seems to imply that the guy has a 1/3 chance of being innocent, which to jurors conjures up the vision of the police seeing any group of 3 people and grabbing one of them randomly, which is an inaccurate picture of the process.

True.

The only thing I can say in the Times’ defense is that the prosecution also presented an inaccurate picture. They pounded on the “one in 1.1 million” odds, while ignoring the math which shows that the actual odds are nowhere near that high.

The probability of the genetic match in this case has to be considered in the light of what the statistical universe involved is. If our universe is the total population of the globe, we can expect 6090 matches to the DNA sample. If, on the other hand, our statistical universe is the population of people who were in the same county with the victim at the time of the murder, and whose MO matches (or “is consistent with”) the details of the murder, we’re looking at a much smaller number of possible matches.

I think ideally, the DNA match should not have been introduced as evidence, simply because, by itself, it’s too tenuous a link. I suspect the prosecutor probably couldn’t resist the chance to throw a big number before the jury, as a last flourish.

But this flourish has the potential to turn around and bite him, and other prosecutors.

If the LATimes has its way, everyone on the planet will read this article. (No, this is not conspiracy-mongering. The LATimes wants everyone on the planet to read all its articles. And especially, their ads.) Any jurors who have read this article will have doubts about the meaning of those big numbers. Any prosecutor had better be able to justify those numbers, and show why they either aren’t being diluted by having taken millions of shots in a database, or are robust enough that even if you divide it into the population of the planet, you’re still looking at a thousands-to-one longshot.
Karl Lembke (5418e6) — 5/6/2008 @ 8:34 pm
#37 Apogee: Nope, not you at #30. Started to say something else and thought better of it.
EW1(SG) (84e813) — 5/6/2008 @ 8:36 pm
Daryl #41,

That makes absolutely no sense.
nk (1e7806) — 5/6/2008 @ 8:47 pm
nk, here’s how I got the numbers. It’s based on the following calculation (which a commenter put on Patterico’s first thread)

If there’s a 1-in-1.1M chance that any given person will have those 5.5 markers, that does NOT mean that there is a 1/3 chance that someone in a DB of 338k people will match.

What it really means is, for each person, there is a:

99.99990909% chance that they will NOT match.

Let’s call that probability N.

If you have a group of 10 people, the probability that none of them will match is:

N ^ 10

If you have a group of 338,000 people, the probability that none will match is:

N ^ 338,000 == 73.54%

The chance that there will be ONE match among them is:

N ^ (338,000 – 1) * (1 – N) * 338,000

The chance that there will be exactly X matches among them is:

N ^ (338,000 – X) * (1 – N) ^ X * nCr(338,000, X)

where nCr means “n choose r”, or “how many different ways can we choose X people out of 338,000

So to know the chance that there will be exactly 2 hits, if the perp is not in the DB, you calculate:

N^(338,000 – 2) * (1 – N)^2 * 338,000 * 337,999 / 2

which is equal to 3.47%

If the suspect is not in the DB, there is still about a 3.5% of having 2 innocent people in that DB who have those 5.5 markers. Even though the chance of any individual, by chance, having those markers is 1/1.1M, it’s possible there would be two such persons in the database.

You don’t need a database of size 2.2M people before it’s possible to have 2 hits in the DB. It’s possible, it’s just not very likely (only about 3.5%)

—

Let’s use a simpler example to illustrate the point. Instead of people, let’s use dice. You have 6 dice rolls in a database. The value of each roll could be between 1 and 6, chosen at random.

How many 6s are in the database?

You could say: there will always be exactly 1 six in the database.

(1/6 chance of a 6, times 6 dice == 1)

But that would be wrong. The dice are thrown at random.

Sometimes you would have 0 sixes within 6 consecutive rolls (somewhat common; 33.5% of the time). Sometimes, all 6 would be 6s (about once every 47,000 times you rolled 6 dice).

The thing is, it’s not always going to be exactly 1 six, just because there are 6 rolls total.

Here are the actual probabilities, in %, based on the formula I gave above:
0 – 33.5
1 – 40.2
2 – 20.1
3 – 5.4
4 – .80
5 – .06
6 – .002

That’s how many 6s you can expect every time you roll 6 dice.

—-

In the real-world example, when you examine a person’s DNA for the first time, it is like rolling a die with 1-in-1.1M odds against a match. But that doesn’t mean you need a DB of size 2.2M in order to have 2 matches. You can have 2 innocent matches in a dinky little 338k DB. That will happen more than 3% of the time.
Daryl Herbert (4ecd4c) — 5/6/2008 @ 9:31 pm
Karl #44 – The only thing I can say in the Times’ defense is that the prosecution also presented an inaccurate picture. They pounded on the “one in 1.1 million” odds, while ignoring the math which shows that the actual odds are nowhere near that high. – agreed – see my #38.

James B. Shearer #40 (about my 38 – wow, lots of numbers today)
I agree that the defense should be able to present whatever it wants. I just have a problem with the notion of reversal due to DB trolling being a “secret”. As Lembke has stated in his #44, both the prosecution and the Times were incorrect in their presentation of the material. That is the key, and the reason that I disagree with your statement. I find it unacceptable for the defense to put forth inaccurate information, even if it will result in a win in that particular case. Simply trolling a database was not the problem. The problem has arisen from the improper presentation of the data obtained from that trolling to laymen. (a set to which I belong) It seems that you would like to label a type of data gathering as improper, rather than accurately communicate what data is gathered from that method, along with its significance.

Improper presentation does nobody any good, as it can be used in either direction, as we’ve seen here with the 1:1,100,000 vs. 1/3 arguments. Bad presentation can wrongly convict innocent defendants in the future, and that’s what both sides are trying to prevent.
Apogee (366e8b) — 5/6/2008 @ 10:26 pm
James B. Shearer – thanks for calling me out, though. “I don’t agree” is a bit vague, to say the least.
Apogee (366e8b) — 5/6/2008 @ 10:29 pm
48

Well I was under the impression that the defense was not allowed to tell the jury anything about how the defendent came under suspicion. I believe the facts that it was through a search of a large database and that given the accuracy of the test there was a significant chance of initally suspecting an innocent person are important and should have been available to the jury. Of course you can argue about exactly what the defense should have been allowed to tell the jury but I think they should have been allowed to tell them something.
James B. Shearer (fc887e) — 5/7/2008 @ 8:53 am

5/6/2008

Volokh on DNA and Cold Hits

50 Responses to “Volokh on DNA and Cold Hits”

Favorite Sites

Links

Patterico Sells Out