Patterico's Pontifications

6/18/2008

L.A. Times Now Properly Describing Statistics in Article About DNA, Cold Hits, and Databases

Filed under: Dog Trainer,General — Patterico @ 7:41 pm

The L.A. Times never did correct that misleading statement of theirs in an article about DNA, databases, and cold hits.

But we’ve achieved a partial victory: they are now describing the statistics properly.

Remember the original passage that so distressed me:

Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person.

In Puckett’s case, it was 1 in 3.

As I said in this post:

I believe the article meant to say this: if the database had consisted only of innocent people, there was a 1 in 3 chance that the search would hit on an innocent person. Phrased that way, the statement would be accurate, and would shed light on the question of how surprised we should be by a database hit.

But that’s not what the paper said. Instead, the article indicated the odds that the search “had hit” on an innocent person — in other words, the odds that Puckett himself was innocent.

The italics are in the original.

Note that I emphasized two problems with the passage:

1) The article said “had hit” when it should have said “would hit.”

The actual search resulted in only one hit: to the defendant. The article said there was a 1 in 3 chance that the search “had hit” on an innocent person. That was the same as saying there was a 1 in 3 chance that the defendant was innocent — which is wrong.

2) The article omitted a key assumption: that the database consisted entirely of innocent people.

By leaving out this assumption, the article further created the impression that the odds referred to the odds that the particular defendant on trial was innocent.

Yesterday, the L.A. Times reported on a related California Supreme Court decision that refused to require the database adjustment touted by the original L.A. Times article. This time around, reporters Maura Dolan and Jason Felch do a much better job explaining the 1 in 3 statistic:

The Times reported in May that a San Francisco judge presiding over a murder trial allowed jurors to hear the rarity statistic of 1 in 1.1 million. But the judge refused to permit the defense to reveal that there would be a 1 in 3 chance of finding a DNA match in the database, even if the actual perpetrator was not among the profiles.

The two errors I complained about before are now gone.

Proper use of forward-looking “would be” verb tense? Check.

Inclusion of key assumption that the perpetrator was not in the database? Check.

No, the paper never issued a correction. But I’m going to count this as a victory.

Not for me. For accuracy.

Comments (36)

36 Responses to “L.A. Times Now Properly Describing Statistics in Article About DNA, Cold Hits, and Databases”

No, the paper never issued a correction. But I’m going to count this as a victory…For accuracy.

“You’re right.” 🙂 Congrats Patterico, and to others who worked to get this corrected.
no one you know (1ebbb1) — 6/18/2008 @ 8:06 pm
The paper probably holds that the two expressions are equivalent/identical.
.
Not that they honestly believe that, but they are prepared to argue it.
cboldt (3d73dd) — 6/18/2008 @ 8:20 pm
There is to my mind a glaring inaccuracy in a statement by the CA Supreme Court, however. From the LAT article cited:

The state Supreme Court said the prosecution properly told the jury that “it was virtually impossible that anyone other than defendant could have left the evidence at the crime scene.”

Although I know of no case where it has happened this way, I would point out that if one has a small DNA sample, one can create a quantity of identical DNA with relatively little expense. After all, that’s what labs do to test samples.

Therefore splashing a quantity of some random person’s DNA around a crime scene is not “virtually impossible”. It may be unlikely now, but probably only because some miscreant hasn’t tried it yet. It’s not as though someone couldn’t do it. It certainly isn’t “virtually impossible”.

That said, congratulations on persuading the LAT to get their facts straight on the issue.
Occasional Reader (16e7ce) — 6/18/2008 @ 8:23 pm
if the database had consisted only of innocent people, there was a 1 in 3 chance that the search would hit on an innocent person.

What? No. There would be a 100% chance that it would hit on an innocent person.
j curtis (c84b9e) — 6/18/2008 @ 8:30 pm
Great catch Patterico.

My beef with LA Times is not so much with the liberal slant as with the inaccurate reporting. People already have their own opinions — and want to pick up the paper to just read lots of stuff that is true.

LA Times journalists have very little pride for their craft and they see their role as helping the Democrats instead of reporting true stuff. California Conservatives know that liberals need zero help from LA Times.
Wesson (785f2a) — 6/18/2008 @ 8:32 pm
— People … want to pick up the paper to just read lots of stuff that is true —

.

Yep. That’s accurate. But in the immortal words of Mick Jagger, “You can’t always get what you wa-ant.”

.

— There would be a 100% chance that it would hit on an innocent person. —

.

“Innocence is forever” Just kidding – I know what you mean. Would hit a person with no priors.
cboldt (3d73dd) — 6/18/2008 @ 8:46 pm
Michelle Malkin billed out 6 figures . What shall your consultancy bill at?
Ed (a9dfde) — 6/18/2008 @ 10:19 pm
2) The article omitted a key assumption: that the database consisted entirely of innocent people.

Why are you still clinging to that canard? It’s not a key assumption of the 1 in 3 random match probability described in the original article. It’s only a key assumption for the 1 in 3 any match probability described in the new one.

The odds that someone will randomly match are 1 in 3, whether the killer is in the database or not. If the killer is not in the database, you have 338,000 records unrelated to the killer, each of which stands a 1 in 1.1 million chance of randomly matching to him. If the killer is in the database, you have “only” 337,999 such records, each of which still stands a 1 in 1.1 million chance of randomly matching to him. The impact of removing that 1 record from the mix on the aggregate 1 in 3 figure is negligible.
Xrlq (62cad4) — 6/19/2008 @ 3:46 am
It’s a key assumption for calculating the odds of a hit to an INNOCENT person.

“The odds that someone will randomly match are 1 in 3, whether the killer is in the database or not.”

Indeed.

Funny, when I said exactly that in a letter to the LAT, you jumped down my throat. Glad to see you’ve come around.
Patterico (99b13f) — 6/19/2008 @ 8:19 am
j curtis,

Before the search you don’t know whether ANYONE is in the database. Hence the need for the forward-looking “would” verb tense.
Patterico (21c198) — 6/19/2008 @ 8:21 am
Everyone said you were right Patterico. And now its been proven. Bravo!
love2008 (1b037c) — 6/19/2008 @ 8:41 am
It’s a key assumption for calculating the odds of a hit to an INNOCENT person.

No, it’s not. The odds of a random match to a innocent person are 1 in 3. Period. It makes no (significant) difference whether you run the test on a database consisting of 338,000 innocents or one consisting of “only” of 337,999 innocents, plus one guilty guy. Either way, the odds that one of the 337,999+ innocents will randomly match are 1 in 3.

“The odds that someone will randomly match are 1 in 3, whether the killer is in the database or not.”

Indeed.

Indeed, indeed. That’s why I find it so puzzling that you are continuing to suggest otherwise, effectively confusing the known random match probability factor (1 in 3) with the unknown, non-random odds that the killer is in the database. One has nothing to do with the other.

Funny, when I said exactly that in a letter to the LAT, you jumped down my throat.

That’s because you didn’t say exactly that in your letter to the LAT, nor anything close. What you did say was was something much sillier, to wit:

The 1-in-3 number does not pertain to the probability that the database search had hit upon an innocent person. Rather, the 1-in-3 number pertains to the probability that a database search will result in a single match — whether that match is to an innocent person or a guilty one.

That is, for want of a better phrase, a load of crap. The 1 in 3 figure pertains to the odds of some innocent person randomly matching the killer’s profile. It tells us nothing about the aggregate odds that there will either be a random match to an innocent (which alone stands a 1 in 3 chance of occuring) or a non-random match to a non-innocent (which runs a 1 in n chance of occuring). All we can say about the combined odds is that if there is any possibility of a true match to the real killer, then the odds of “a single match — whether that match is to an innocent person or a guilty one” are necessarily higher than 1 in 3.
Xrlq (b71926) — 6/19/2008 @ 9:23 am
It’s a key assumption for calculating the odds of a hit to an INNOCENT person.

No, it’s not. The odds of a random match to a innocent person are 1 in 3. Period. It makes no (significant) difference whether you run the test on a database consisting of 338,000 innocents or one consisting of “only” of 337,999 innocents, plus one guilty guy. Either way, the odds that one of the 337,999+ innocents will randomly match are 1 in 3.

All right. You are correct about this. Remind me never to toss off a quick comment about statistics again. Then some LAT reporter will seize on it as an example of how we all get it wrong from time to time, ergo it’s OK for them to get it wrong in a front-page article.

What I meant, and used inaccurate shorthand for, was this:

If you’re taking a situation where you received only one hit, the most useful thing for the jury to know that can be based on the numbers we know is this: what are the chances that a search of a database of innocent people will result in a hit (which by definition will be a hit to an innocent person)? That’s what the database adjustment is there to illustrate. But the key assumption is that the database consists entirely of innocent people.

I still disagree with your assertion:

The 1-in-3 number does not pertain to the probability that the database search had hit upon an innocent person. Rather, the 1-in-3 number pertains to the probability that a database search will result in a single match — whether that match is to an innocent person or a guilty one.

That is, for want of a better phrase, a load of crap. The 1 in 3 figure pertains to the odds of some innocent person randomly matching the killer’s profile. It tells us nothing about the aggregate odds that there will either be a random match to an innocent (which alone stands a 1 in 3 chance of occuring) or a non-random match to a non-innocent (which runs a 1 in n chance of occuring).

But honestly, I’m tired of arguing it with you. One night I took you through it on the phone, step by step, and it took about an hour, with you fighting the whole way. Finally, you were convinced. Then I got off the phone and you re-convinced yourself otherwise.
Patterico (cb443b) — 6/19/2008 @ 7:15 pm
Hell, I’m a glutton for punishment.

Let’s try it this way.

In our example, with the profile in question occurring in about 1 in 1.1 million people, there are about 6000 people who share that profile. Our database consists of 338,000 people. We have no idea how many of the 6000 people sharing the profile, if any, are in that database.

Let me introduce you to three of those 6000 people.

1) Meet Joe Black. Mr. Black shares the same profile as the DNA donated at the crime scene, but he is not the donor. I.e. Mr. Black is not the real killer.

2) Meet John Killer. Mr. Killer shares the same profile as the DNA donated at the crime scene, but he is not the donor. I.e. Mr. Killer is, ironically, not the real killer.

3) Meet Mr. Orenthal “O.J.” Simpson. Mr. Simpson shares the same profile as the DNA donated at the crime scene, because he is, in fact, the donor. I.e. Mr. O.J. Simpson is the real killer.

Let me tell you a weird fact about the other 5997 people in the world who share the profile. They have names like Mr. 1, Mr. 2, Mr. 3, etc. — all the way up to Mr. 5997.

Oh, and with 6000 people sharing the profile, that leaves around 5,999,994,000 people who don’t share the profile. Let me introduce you to one of them:

4) Meet Ralph Control. Mr. Control doesn’t share the profile and of course is not the real killer.

Now. So I can figure out what the hell you’re talking about: please answer each assertion true or false, based on the given assumption.

For the first four questions, we don’t know who is in the database. We just don’t know.

We don’t know if Mr. Black is there. We don’t know if Mr. Killer is there. We don’t know if Mr. Simpson is there. We don’t know if Mr. 1 is there. We don’t know if Mr. 830 is there. We don’t know if Mr. Control is there.

1) Assumption: we don’t know if Mr. Black is in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

2) Assumption: we don’t know if Mr. Killer is in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

3) Assumption: we don’t know if Mr. Simpson is in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

4) Assumption: we don’t know if Mr. Control is in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

For the next four questions, I will tell you someone is not in the database. Please answer the assertions true or false.

5) Assumption: we know Mr. Black is not in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

6) Assumption: we know Mr. Killer is not in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

7) Assumption: we know Mr. Simpson is not in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

8) Assumption: we know Mr. Control is not in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

Maybe I’ll understand what you mean when you have answered those questions.
Patterico (cb443b) — 6/19/2008 @ 11:01 pm
I guess the number 8 followed by a parenthesis converts to some damn smiley-face. Whatever.
Patterico (cb443b) — 6/19/2008 @ 11:03 pm
1) False.
2) False.
3) False.
4) False.
5) False.
6) False.
7) True.
8) False.

In all eight scenarios, there is a 1 in 3 chance of a random match to someone other than Simpson, plus an undefined possibility of a match to Simpson himself. We don’t know how high that possibility is, but we do know that it’s substantially higher for him than for any of the other 5,999 who share his profile. Why? Because we’re looking for a killer in a database of criminals, not a database representing a random cross-section of the general public. You didn’t mention that detail in your example, but as you know, that is a crucial factor in the real world.

Conversely, if your example contemplates a purely hypothetical database in which every individual in the world stands an equal chance of ending up among the few, the proud, the 338,000, then in that case, knowledge of the absence of any particular individual would have a negligible impact on the total odds of a match, and the answer to all 8 assertions would be “true.”

What I meant, and used inaccurate shorthand for, was this:

If you’re taking a situation where you received only one hit,

Careful, Sparky. You just committed the prosecutor’s fallacy yourself. The 1 in 3 figure pertains to the odds that an innocent person will match in any given database search, not to the probability that it did. Before the search, you don’t know if you will receive only one hit, 200 hits, or none.

the most useful thing for the jury to know that can be based on the numbers we know is this: what are the chances that a search of a database of innocent people will result in a hit (which by definition will be a hit to an innocent person)? That’s what the database adjustment is there to illustrate. But the key assumption is that the database consists entirely of innocent people.

Nonsense. The only key assumption is that the database consist of aproximately 338,000 innocent people. It makes no difference (or rather, only the most trivial one) whether the database consists of exactly 338,000 innocents or “only” 337,999. Either way, the odds that the database will randomly match to one innocent are 1 in 3.

I guess the number 8 followed by a parenthesis converts to some damn smiley-face. Whatever.

You can fix it by replacing the right paren with the HTML string ). That tells WordPress “yup, I really want you to type ASCII character No. 41, a right paren, and not something else that your dimwit programmers assume I must have meant. Cf. “copyrighted” statutes, i.e., any that have a subsection (c).
Xrlq (62cad4) — 6/20/2008 @ 4:14 am
Regardless of the details of DNA probabilities, one certainty remains: If the Los Angeles Times is anywhere near a story that has a political issue to exploit, rather than report, you can bet the farm that the Times is guilty of evidence tampering.
C. Norris (ffabe7) — 6/20/2008 @ 9:15 am
C. Norris, got any farms you’d like to bet? I have no doubt that the Times frequently makes errors that just “happen” to favor their political agendas, but this case sure as hell was not one of them. Fixing their only real error (the prosecutor’s fallacy, Item 1 above) would have made the case against Puckett look weaker, not stronger.
Xrlq (b71926) — 6/20/2008 @ 9:51 am
Careful, Sparky. You just committed the prosecutor’s fallacy yourself.

Uh, no, I most certainly did not. Try reading what I wrote again, without jumping in the middle of a sentence to interrupt and misunderstand my point.
Patterico (cb443b) — 6/21/2008 @ 1:07 am
Meet Mr. X(rlq). Mr. X shares the same profile as the DNA donated at the crime scene, but he is not the donor. I.e. Mr. X is not the real killer — in this case.

But Mr. X is a hell of a criminal. Guilty of rapes, murders, kidnappings, robberies, and carjackings, he has been to prison many times.

9) Assumption: we don’t know whether Mr. X is in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

10) Assumption: we know Mr. X is not in the database.
Assertion: There is a 1 in 3 chance of a hit to someone in the database.

True or false?
Patterico (cb443b) — 6/21/2008 @ 1:14 am
In all eight scenarios, there is a 1 in 3 chance of a random match to someone other than Simpson, plus an undefined possibility of a match to Simpson himself.

Why not just say there is a 1 in 3 chance of a random match?

Is it a) because Simpson is the REAL KILLER?

Or just because b) he’s a criminal and it’s a database of criminals?

This is what questions 9 and 10 are designed to determine.

Because you have continually refused to simply say there is a 1 in 3 chance of a hit, period. You always carve out the real killer, and talk about his odds differently. I always point out that you could do the same with anyone who would match the profile, e.g.: “In all eight scenarios, there is a 1 in 3 chance of a random match to someone other than John Killer, plus an undefined possibility of a match to Killer himself.”

Because I have always, always, always approached this as a pure stats question that doesn’t take into account extraneous factors like the fact that it’s a database of criminals.

Always.

Hell, if you’re going to insist on taking into account Simpson’s status as a killer, then you can do the same for other factors that we don’t know. Does Joe Black live in California or in Siberia? Is John Killer 49 years old or 9 hours old?

Aren’t those questions relevant to a search of a database of California adult criminals?

Why focus only on the status of someone as a criminal, when other factors (like age and geography) also affect their likelihood of being in the database?

Also, you’re making assumptions that the facts don’t cash, like the assumption that the other people are not otherwise criminals. Is John Killer a murderer? I didn’t tell you he wasn’t. I just said he wasn’t the murderer in THIS case.

With a name like Killer . . .

I have ignored such issues and approached it from a purely statistical viewpoint, like this:

Conversely, if your example contemplates a purely hypothetical database in which every individual in the world stands an equal chance of ending up among the few, the proud, the 338,000, then in that case, knowledge of the absence of any particular individual would have a negligible impact on the total odds of a match, and the answer to all 8 assertions would be “true.”

That is how I have always approached the question. As a purely hypothetical statistics question.

Does that bridge the divide we have had for weeks on this issue?
Patterico (cb443b) — 6/21/2008 @ 1:31 am
Ok. Let’s start with the database. It is a skewed sample. It is not a representation of the population. It is not composed of 338,000 people randomly selected. It is composed of people who attracted the attention of the authorities. It is a valid sample if you are doing a study on the characterstics of criminal suspects. But unless a study has been done which has calculated how closely the database reflects the general population, it is not useful for the probabilities we are trying to calculate here. We cannot simply assume that those five and a half markers exist as evenly in the database’s population as they do in the general population.

The database, here, is useful as a starting point in an investigation only because we do not have the DNA of every person in the world. It’s the best we have so far. 338,000 people who no longer fly under the DNA radar. No more than that.

Going to the five and a half markers, that 1.1 million “probability” also has to be refined, in the context of a criminal investigation. We know that the victim identified her rapist as, say, a white, fair-haired, thirty-something man. We can eliminate females, all other races, some ethnic groups and men under thirty and over forty. But again, this is for the detectives. For statisticians, even amateur ones such as ourselves, we cannot make calculations unless we know the distribution of those five and half markers among white, fair-haired, thirty-something men versus their distribution in the population generally.

Does any of this make sense?
nk (d86adb) — 6/21/2008 @ 6:30 am
Careful, Sparky. You just committed the prosecutor’s fallacy yourself.

Uh, no, I most certainly did not. Try reading what I wrote again, without jumping in the middle of a sentence to interrupt and misunderstand my point.

Putting the fallacious clause back into context doesn’t fix anything. You said, and I quote:

If you’re taking a situation where you received only one hit, the most useful thing for the jury to know that can be based on the numbers we know is this: what are the chances that a search of a database of innocent people will result in a hit (which by definition will be a hit to an innocent person)? That’s what the database adjustment is there to illustrate. But the key assumption is that the database consists entirely of innocent people.
[Emphasis added.]

Aside from being wrong about the key assumption, your use of the past tense (“received” rather than “receive”) to describe the original selection odds is a dead giveaway; it is exactly the same error as the Times committed when it said there was a 1 in 3 chance that the database search had matched to an innocent in Puckett’s case rather than saying that there was originally a 1 in 3 chance that it would. Moreover, unlike the Times’s faux division error, this one cannot be chalked up to mere sloppy verbiage; the whole point about ending up with only one match only makes sense when analyzing the data after the fact.

9) False. Based on what you’ve told me about Mr. X, he stands a better than average chance of being in the database.
10)True.

In all eight scenarios, there is a 1 in 3 chance of a random match to someone other than Simpson, plus an undefined possibility of a match to Simpson himself.

Why not just say there is a 1 in 3 chance of a random match?

Because there isn’t just a 1 in 3 chance of a random match. There is a 1 in 3 chance of a random match (i.e., a match to someone other than Simpson or any of his blood relatives), and there is also a wholly unrelated 1 in something chance that Simpson himself will be in there, generating a match of his own. Two unrelated possibilities, best represented by two different, slightly overlapping circles on a Venn diagram. Add in a third circle representing the probability that one of Simpson’s blood relatives will be in there and generate yet another match, and the three circles together represent the probability “that a database search will result in a single match — whether that match is to an innocent person or a guilty one.” The first circle alone gets us the 1 in 3 figure. The non-overlapping portions of the other two get us something in addition to that. We don’t know how much, but we do know it’s something (unless, of course, you are 100% certain that neither O.J. nor any of his blood relatives are in there).

Is it a) because Simpson is the REAL KILLER?

Of course. As you know, the 1 in 1.1 million / 1 in 3 figure represents the odds that 1 record / 1 of 338,000 records will generate a match to someone who is neither the original donor nor anyone related to him by blood. It tells you nothing about the odds that either the original donor or any blood relative (1) is in the database or (2) will generate a separate match if he is.

Or just because b) he’s a criminal and it’s a database of criminals?

That significantly increases the chances that he’ll be in the database. It has no impact on the chances that he’ll generate a match if he is (we assume those chances are 1 or very close to that). More importantly, while it reduces the odds to zero that any other record in the database is from the real killer (because all records are unique), it has no impact on the likelihood that any of the other records are from Simpson’s blood relatives, nor does it have any impact on the likelihood that any of the records pertaining neither to Simpson nor to his relatives (not quite the original 338,000, but damned close) will also generate a random match of its own. Each of those innocent records still carries a 1 in 1.1 million chance of returning a match, so collectively, they still carry a 1 in 3 chance that at least one innocent will return a match.

Because you have continually refused to simply say there is a 1 in 3 chance of a hit, period.

Indeed I have, because it’s not true. We don’t know the likelihood that there will be a hit to an innocent relative or to the original donor, but we do know that either of these things can happen, for reasons having nothing to do with the 1 in 1.1 million / 1 in 3 figure. All that figure assumes is that the database includes approximately 338,000 records from innocent non-relatives. And we know that assumption to be true whether the killer and his relatives are in there or not.

Because I have always, always, always approached this as a pure stats question that doesn’t take into account extraneous factors like the fact that it’s a database of criminals.

Always.

Not sure why you think that consistently assuming a counterfactual is better than occasionally getting the facts right, let alone why you’d accuse the Times of making an error for reporting on the database as it exists in the real world rather than as it would exist in your hypothetical one, but no matter. Let’s assume for argument’s sake that everything we know about the database is wrong, and that it really does consist of 338,000 unique profiles of random individuals across the world. It’s still true that every record from a person unrelated to the killer stands a 1 in 1.1 million chance of returning a match, and that every record from the killer or his blood relative has a higher chance than that. However, the world is awash with people who are not related to the killer, and only a handful exist who are, so the odds of either the killer or his relative being in the database can be essentially eliminated. There is still a 1 in 3 chance of getting a random match to an innocent person who is unrelated to the killer, but there is almost no chance of getting a match to the killer himself, or to any close relatives.

But suppose that by random chance the killer does get in the database somehow. Now what are the chances that the remaining 337,999 records will also generate a match to someone else who is not related to him? Answer: still 1 in 3. To argue otherwise is to commit the gambler’s fallacy, akin to expecting dice to “remember” that you just rolled snake eyes, and therefore assuming that the odds of rolling it a second time are now 1 in 1296 rather than 1 in 36.

Does that bridge the divide we have had for weeks on this issue?

No. The only way to bridge the divide between your perception and reality is for you to admit that as long as we can assume that the database contains approximately 338,000 unique records from innocent individuals who are neither the donor nor his blood relative, the odds that the database search will return a random match to one of them are 1 in 3, without regard to who else might also be in the database or how they would have gotten there. And as long as you cling to your error, you’re really in no position to criticize the Times for failing to correct theirs.
Xrlq (62cad4) — 6/22/2008 @ 8:31 pm
Aside from being wrong about the key assumption, your use of the past tense (”received” rather than “receive”) to describe the original selection odds is a dead giveaway; it is exactly the same error as the Times committed when it said there was a 1 in 3 chance that the database search had matched to an innocent in Puckett’s case rather than saying that there was originally a 1 in 3 chance that it would. Moreover, unlike the Times’s faux division error, this one cannot be chalked up to mere sloppy verbiage; the whole point about ending up with only one match only makes sense when analyzing the data after the fact.

WRONG.

I’m getting very impatient with you. You’re not trying to understand.

Stop trying to be clever. Stop trying to play gotcha. Clear your mind and listen for a second.

I understand the past tense/future tense issue very well. That is the whole point of my comment. Given that we have received a result — one result — the way to express ANY odds in a way that makes sense to the jury is NOT to express the odds of what actually happened. It is, instead, to express a concept that DID NOT happen: that you started with a database of completely innocent people, in which case you would still have a 1 in 3 chance of a hit.

That is useful information, and it does not stray into the prosecutor’s fallacy.
Patterico (cb443b) — 6/22/2008 @ 9:38 pm
That is the whole point of my comment. Given that we have received a result — one result — the way to express ANY odds in a way that makes sense to the jury is NOT to express the odds of what actually happened. It is, instead, to express a concept that DID NOT happen: that you started with a database of completely innocent people, in which case you would still have a 1 in 3 chance of a hit.

That is useful information, and it does not stray into the prosecutor’s fallacy.

Wrong, and wrong. First, the canard about the 1 in 3 figure assuming a database of exclusively innocent people (as opposed to a database that includes 338,000 records from innocents, and may or may not include anything else) is not useful information. It is nothing more than a figment of your imagination, with no basis in science, math or statistics. The only concept to express is one that DID happen: that you started with a database that included roughly 338,000 innocent people not related to the killer by blood, and therefore had a 1 in 3 chance that one of them would generate a match. This possibility is wholly independent of the 1 in x possibility that there may be a few records from relatives in there, and of the 1 in y possibility that the killer himself is.

Second, any discussion after the search, which takes into account information that could not have been known prior to the search, is necessarily tainted by the prosecutor’s fallacy. Once you argue based on facts known only after the fact, you’ve effectively conceded that the selection odds don’t mean that much anyway, except for the historical value of being able to tell the jury what the original odds were. I suppose you could argue that since we know there was only one match, it makes more sense now to discuss the original selection odds of a single random match (1 in 4) rather than the original selection odds of at least one (1 in 3), but neither figure has anything to do with whether the killer was also in there. That matters only because we now have to compare the original odds of a false match (1 in 3, or 1 in 4 if you prefer) to the odds of a true match (the 1 in n probability that the killer was in there) to determine which of the two is more likely now. That’s the prosecutor’s fallacy problem, which derives from the Times’s use of “did” when they should have said “would.” It has nothing to do with your allegation that the original 1 in 3 figure depended on an assumption which, in fact, was utterly irrelevant.

In other words, if you ever get around to correcting your glaring error about the 1 in 3 figure supposedly depending on the killer being nowhere in the database, rather than merely assuming that the database includes roughly 338,000 individual records from unrelated innocents, then you’re left with only one legitimate complaint about the L.A. Times coverage of the stats, not two. Or, if you prefer, you’re left with two objections to the original article, namely:
1. In presenting the 1 in 3 figure as the odds that an innocent had matched rather than as the original odds that one would, the Times committed the prosecutor’s fallacy.
2. In discussing the 1 in 3 odds that an unrelated innocent would/did match, without comparing it to the odds that a guilty person (or an innocent blood relative) would/did, the Times reached a false conclusion resulting from … the prosecutor’s fallacy.
In other words, one error, which manifested itself in two discrete ways (or perhaps more still, if you dig deeply enough). Not two separate errors, just two shaky conclusions that both follow from a single error. Applying the 1 in 3 figure after the fact was The Times’s error. Tying that figure to the irrelevant question of whether or not the killer is in the database is yours.
Xrlq (62cad4) — 6/23/2008 @ 4:10 am
Wrong, and wrong. First, the canard about the 1 in 3 figure assuming a database of exclusively innocent people (as opposed to a database that includes 338,000 records from innocents, and may or may not include anything else) is not useful information. It is nothing more than a figment of your imagination, with no basis in science, math or statistics.

Since I can’t speak reason to you, I’ll just argue from authority and be done with it. A statistician who was on the panel that made the recommendation for the adjustment e-mailed me this as an example of a non-fallacious statement that would be helpful for a jury:

“If a database the size of the one in Puckett’s case contained DNA samples only from innocent persons, then the chance of a match after a search of that database would be 1 in 3.”

That’s what jurors should be told. I’m done talking to you about it. The end. Good bye.
Patterico (cb443b) — 6/23/2008 @ 6:05 am
No, not the end. The end, at least for now, is that you’re just as wrong now as you were from the start. The reason you “can’t speak reason” to me or anyone else on the issue is that you are clinging to the fundamentally unreasonable position that the presence or absence of the killer’s DNA in one measly record among 338,000 has any meaningful impact on the likelihood that any of the other 337,999 records will return a random match. Your authority, whom you read but obviously did not understand, rightly noted that if you stipulate that the killer (and presumably any blood relative) is not in the database, then you have systematically excluded the possibility of anything other than a random match to an innocent non-relative and therefore, the odds of any match at all and the odds of any match to an innocent non-relative are one and the same. So far, so good. From there, you jumped off a proverbial cliff by concluding that the odds of any match (guilty or innocent) are always 1 in 3, whether the killer is in there or not. You have no authority for that non sequitur. You simply made it up.

Given your increasingly irrational and self-righteous tone (cf. your earlier banter over “my” mental block), I’m beginning to think that at some level you’re more interested in continuing to delude yourself into thinking you are right than you are in actually getting this thing right. I hope I’m wrong about that, but if I am, no need to waste any more time talking “reason” to me. Instead, follow up with your authority and ask him what he thinks the odds are of getting a match to an innocent non-relative if the killer is in the database. If the answer is anything but 1 in 3 or the usual variations thereof (e.g., 1 in 4 to represent the odds of a single match), then I promise to take your place in the naked parade. To avoid any further confusion over who said what and meant what else, try emailing him the following, verbatim:

I understand from our prior correspondence that if a database the size of the one in Puckett’s case contained DNA samples only from innocent persons, then the chance of a match after a search of that database would be 1 in 3. Suppose instead that the database contains one DNA sample from the killer and ten others from his closest blood relatives, while all of the remaining samples are from innocent non-relatives. What are the odds that the database search will return a match to at least one of these innocent non-relatives?

Go ahead, email him that question. Prove me wrong, or admit to having gotten it wrong yourself. If you persist in doing neither, you forfeit any right to criticize the L.A. Times for failing to adequately investigate, acknowledge or correct errors of their own. Blogger, heal thyself.
Xrlq (b71926) — 6/23/2008 @ 7:56 am
There are many, many, many distortions of my opinion here. But the worst is cutting me off in the middle of a goddamned paragraph like a fifth-grader and accusing me of a fallacy, when, if you would get your fingers off the keyboard for one second and THINK and READ before you type, you would see there is no such freaking fallacy.

Here is what I said again:

“If youre taking a situation where you received only one hit, the most useful thing for the jury to know that can be based on the numbers we know is this: what are the chances that a search of a database of innocent people will result in a hit (which by definition will be a hit to an innocent person)?”

“will result in a hit”

“will result in a hit”

“will result in a hit”

I believe “will” is future tense. As in “will result in a hit”

“will result”

“will”

Future tense.

No. Goddamn. Fallacy.

So why did I say received? To show that in a situation where you received a hit, and only one hit, you run a risk of misleading people if you don’t describe the situation in the way I described.

I agree that there is a roughly 1 in 3 chance of a hit to an innocent person in a database of 338,000 or 337,999. Thus the inclusion of the killer doesn’t really change the forward-looking odds. I’ve actually spent considerable time explaining this to others, so don’t tell me I don’t understand it.

But when you’ve gotten one hit, people are overwhelmingly likely to misread that with the j curtis fallacy of using the forward-looking statistics. So it’s safer to use the “the database is innocent” formulation.

We can’t get any further unless you take back your incorrect assertion that I engaged in the prosecutor’s fallacy. Or. we could play the game of stating each other’s position. Because when I feel like I spend 100 percent of my time telling you that I am not arguing what you claim I am, I get frustrated and fly off the handle. want to try? It would be much, much, much, much more constructive.
Patterico (85d6e8) — 6/23/2008 @ 5:44 pm
I agree that there is a roughly 1 in 3 chance of a hit to an innocent person in a database of 338,000 or 337,999. Thus the inclusion of the killer doesn’t really change the forward-looking odds. I’ve actually spent considerable time explaining this to others, so don’t tell me I don’t understand it.

Good. But frankly, I’m at a loss as to how you can reconcile that with this statement:

The 1-in-3 number does not pertain to the probability that the database search had hit upon an innocent person. Rather, the 1-in-3 number pertains to the probability that a database search will result in a single match — whether that match is to an innocent person or a guilty one.

Maybe I’m just dense, but I’m having a hell of a time understanding how I am supposed to read that last statement as NOT implying that there is only a 1 in 3 chance of a hit to an innocent if the killer is not in the database, and that there is not a 1 in 3 chance of the same if he is. And given the number of other commenters who are convinced that the L.A. Times has willfully misrepresented the stats to make them look worse for the prosecution than they are (when in fact, they if anythign made them look better) I don’t think I’m the only reader who interpreted your statement that way.

But when you’ve gotten one hit, people are overwhelmingly likely to misread that with the j curtis fallacy of using the forward-looking statistics. So it’s safer to use the “the database is innocent” formulation.

Fair enough, but that goes back to the last point of mine that had you so riled up. It is indeed wrong to cite the forward-looking statistics in a context where they’re likely to be misinterpreted after the fact, but if that is the crux of your “whole database is innocent” formulation, isn’t that just another way of saying that the Times committed the prosecutor’s fallacy? After all, the problem with the prosecutor’s fallacy is not with tense per se, but with the fact that once an event has occurred, you now know new information that renders the original forward-looking odds inapposite. If, for example, I told you I was about to flip a coin, you’d rightly guess that the odds of heads vs. tails were 50-50 either way. But if I told you I had flipped a coin, and refused to tell you anything else to make one outcome more likely than the other, you’d be justified in treating the coin toss as though it still had its original 50-50 odds, even though you know at an intellectual level that it really has 100% odds in favor of one and 0% for the other, you just don’t know which.

Thus, I’ll go first in the writing for the enemy game. You have two objections to the original L.A. times article, but both actually boil down to the fact that the authors committed the prosecutor’s fallacy, effectively substituting the original forward-looking odds for odds that take into account what is now known after the fact. Correct?
Xrlq (62cad4) — 6/23/2008 @ 7:46 pm
That’s part of it, but I didn’t think you disagreed with me about that part.

Part of it is that I didn’t commit the prosecutor’s fallacy when you said I did and called me Sparky. Of that I am certain.

Part of it, I am less certain about, but that’s why I want to have the discussion.

I *believe* that one definition of the random match probability is the frequency of occurrence of a profile in a population of unrelated individuals.

The numbers are typically given without reference to whether the killer is in the population. Jurors are not given two sets of stats: set one if the killer is there, and set two if he’s not.

I believe the reason is that the “killer” is simply treated as another person in the population who shares the profile, for purposes of the statistics.

This is why I asked the question above. Because I view the RMP as referring to the frequency of occurrence of a profile in a population of unrelated individuals, I am capable of expressing that probability without reference to any assumptions about whether the killer is in the population — whether that population is a database, or the world population, or a hypothetical group of x quintillion individuals of a particular race (a common way to express the number).

By contrast, you refuse to express the numbers without reference to whether the killer is in the relevant population. This is what has always confused me, because for the statistics — how frequent is this profile in a particular population of unrelated individuals — we don’t have to know whether the killer is in the population or not.

The killer is just another person with the profile.

You could select another person who is not the killer, but shares the profile, and harp on a particular unique characteristic that *that* individual possesses. Say he is the only person with that profile in the world who is named Fred.

If we remove the killer from the population, you agree that (given the numbers in the article) we can say there will be a 1 in 3 chance of a hit.

But what if I hypothesize that one of the 6000 people in the world is named Fred?

It seems to me that, by the logic you have employed in this area for the past several weeks, you would insist on saying:

You can’t say there is a 1 in 3 chance of a hit.

You can only say that there is a 1 in 3 chance of a hit to people who aren’t named Fred, combined with some unknown chance that there will be a hit to Fred, which is a function of the probability that Fred is in the database.

Substitute “the killer” for “Fred” and the logic is exactly the same.

What makes the killer different?

To me, he’s just another guy who shares the profile. Like Fred.

If you don’t agree, then I suspect that one of us is misunderstanding what random match probability means.

Would you agree that I would be right if RMP means the frequency of occurrence of a profile in a population of unrelated individuals?

If you answer yes, then the question is whether I’m right about that.

But please, please, let’s get past the idea that I employed the prosecutor’s (actually technically called the “transposition”) fallacy. You simply stopped reading too soon.

And I get angry about it because I picture Jason Felch reading your comments and smiling to himself, thinking: ha! Patterico committed the same fallacy that he accused me of committing! Xrlq agrees!

And you only said that because you stopped reading too soon and tried to play gotcha instead of trying to understand what I was really trying to say.
Patterico (cb443b) — 6/23/2008 @ 8:07 pm
Part of it is that I didn’t commit the prosecutor’s fallacy when you said I did and called me Sparky. Of that I am certain.

I’ll admit to sloppy language on my part. My point wasn’t really that you had committed the fallacy yourself, but rather, that once you start discussing the propriety of the 1 in 3 figure in terms of facts that could not have been known in advance, you’ve effectively admitted that the real issue is the prosecutor’s fallacy, not the side issue you were purporting to discuss.

The numbers are typically given without reference to whether the killer is in the population. Jurors are not given two sets of stats: set one if the killer is there, and set two if he’s not.

I believe the reason is that the “killer” is simply treated as another person in the population who shares the profile, for purposes of the statistics.

I believe that is incorrect. If I’m not mistaken, the 1 in 3 figure is just the macro version of 1 in 1.1 million figure, which assumes each individual record is not from the killer, nor even from any of his innocent blood relatives. Any records that are from the killer or his close relatives obviously stand a much higher chance than 1 in 1.1 million of returning a match, if indeed they get matched at all. That in turn requires you to know how likely they are to be in the database in the first place, but the 1 in 3 figure doesn’t tell you anything about that likelihood, for the killer, his relatives or anyone else.

I believe that the reason juries are not given two sets of stats is that because they are asked to consider how likely an innocent is to match, which is 1 in 3 whether the killer is in there or not, and are NOT asked to speculate as to how likely anyone (guilty or innocent0 is to match, as the latter would require them to know how likely the killer and his relat5ives are to be in the database in the first place.

This is why I asked the question above. Because I view the RMP as referring to the frequency of occurrence of a profile in a population of unrelated individuals, I am capable of expressing that probability without reference to any assumptions about whether the killer is in the population — whether that population is a database, or the world population, or a hypothetical group of x quintillion individuals of a particular race (a common way to express the number).

Right, but the key there is the phrase “population of unrelated individuals.” Meaning not only unrelated to each other, but also unrelated to the killer they’re being matched up against. Any person unrelated to the killer has only a 1 in 1.1 million chance of matching to the killer, while the killer has essentially a 100% chance of matching to himself. They don’t get much different than that.

The killer is just another person with the profile.

In a completely randomized database, yes. But if we had database like that, a single match would be 6,000 times more likely to be a match to an innocent than to a guilty party. That would be tempered slightly by the fact that the odds were 2-1 against an innocent person being matched, but the result would still be that based on the database search alone, Puckett is 3,000 times more likely to be innocent than guilty. Surely that isn’t your position – is it?

If you don’t agree, then I suspect that one of us is misunderstanding what random match probability means.

I think that’s right. My understanding is that random match probability of 1 in 1.1 million assumes you are comparing DNA samples from two individuals who are not the same person, nor even blood relatives. If you’re matching an individual to himself, or even to his first cousin, the RMP does not come into play. Am I wrong about that?
Xrlq (62cad4) — 6/23/2008 @ 8:56 pm
That would be tempered slightly by the fact that the odds were 2-1 against an innocent person being matched, but the result would still be that based on the database search alone, Puckett is 3,000 times more likely to be innocent than guilty. Surely that isn’t your position – is it?

Purely based on statistics, and not having anything to do with the location of the database, the types of individuals contained in it, etc. — I think the chances that he is guilty are 1 in 6000.

Once you realize it’s a database of California criminals, you can see that the likelihood is much higher — but it becomes hard to quantify with numbers.
Patterico (cb443b) — 6/23/2008 @ 9:03 pm
Fair enough, but that has nothing to do with RMP. All RMP tells us is that if you compare my DNA to yours, there is a 1 in 1.1 million chance that the 5 1/2 indicators in question will match. The RMP doesn’t tell us how likely we are to be related, how likely we are to be the same person, or how likely we are to conduct the study at all. All it tells us is that if we aren’t the same person, and we’re not related, and we do in fact end up comparing the two samples, then there is a 1 in 1.1 million chance of returning a match.

This is, I suspect, the source of the “database of innocents” line. The 1 in 1.1 million figure assumes you are comparing one DNA sample to a single DNA profile from an individual that is not the same person or a blood relative; an “innocent,” if you will. The 1 in 3 figure merely assumes you do the same thing 338,000 times on 338,000 innocents. We can assume that 338,000 unique records will translate into roughly 338,000 innocents, for statistical purposes, because the killer himself can only account for one record, and even an implausibly high estimate of his blood relatives would still be in the noise range. So statistically speaking, we have a database of innocents, whether the guilty man is in there or not. However, that means we can only predict the likelihood of a match to an innocent non-relative, not the likelihood of any match whatsoever. The latter is necessarily greater, as the 1 in 3 figure assumes each record has only a 1 in 1.1 million chance of returning a match, while any records from the killer himself or his relatives would obviously have a higher chance than that.

Returning to my original objection to your objection #2, it appears to me that before a database is searched, the 1 in 3 figure is a valid indicator of the likelihood that there will be a match to an innocent non-relative – whether the killer is in the database or not. After the search has been conducted, the 1 in 3 figure is not a valid indicator of the likelihood that there was a match to an innocent non-relative – whether the killer was in the database or not. Am I wrong?
Xrlq (62cad4) — 6/24/2008 @ 4:24 am
There is a random, unquantifiable chance that there will be a match to a guilty person in the database in the exact same sense that there was a random, unquantifiable chance that a New York City subway passenger would recognize Willie Sutton from a Post Office “Wanted” poster. Because the database is no different than a book of mugshots in a photo lineup. Maybe the perp is in it, maybe he’s not. It is not designed to be anything else. I don’t see the math. The LAT article only added to the confusion started by the court which allowed the DNA evidence.
nk (11c9c1) — 6/24/2008 @ 7:25 am
I just noticed that I had neglected to respond to this part:

But what if I hypothesize that one of the 6000 people in the world is named Fred?

It seems to me that, by the logic you have employed in this area for the past several weeks, you would insist on saying:

“You can’t say there is a 1 in 3 chance of a hit.

You can only say that there is a 1 in 3 chance of a hit to people who aren’t named Fred, combined with some unknown chance that there will be a hit to Fred, which is a function of the probability that Fred is in the database.

Substitute “the killer” for “Fred” and the logic is exactly the same.

What makes the killer different?

The difference is that Fred is a function of the RMP, while the killer is a function of a completely unrelated variable. If Fred is in the database, but is not the killer or a blood relative, then per the RMP, a match to Fred stands the same 1.1 million odds of returning a match as anyone else. But if the killer is in the database, the RMP doesn’t apply to him, as the RMP only tells us the odds of finding a match between unrelated individuals, not the odds that any given individual will match to himself.
Xrlq (b71926) — 6/24/2008 @ 9:18 am
Here’s why objection (2) is really just another manifestation of the prosecutor’s fallacy. As I think we agree, the original odds of a match to an innocent non-relative were 1 in 3, whether the killer was in the database or not. This is so because the RMP only applies to records of innocent non-relatives, so the 1 in 3 figure is derived by multiplying 1 in 1.1 million times all the records that are not from the killer or his relatives. That subset consists either of the entire database, or such a large majority of it that the 1 in 3 figure will still hold as a rough approximation (which is all it was in the first place). 1 in 3 for innocent non-relative matches as a function of the RMP, that is; not 1 in 3 for matches of any kind.

Once the test has been done, you rightly noted that if there was a single hit only, the 1 in 3 figure no longer applies. I’d state that more generally: once you know how many matches there were, the 1 in 3 figure no longer applies. If there were 0 matches, then the probability of an innocent having matched in this case is now 0. If there were 2 or more matches, then that probability is now 1 (even if there were two killers in fact, there’s still only one “killer” who left the DNA sample at the scene of the crime). If there was exactly 1 match, as in Puckett’s case, the probability depends on which is more probable, a false match, which originally ran a 1 in 3 chance of occurring, or a true one, which originally ran a 1 in n chance. And of course, without knowing the value of n the latter comparison is impossible to make.

Far from being a separate error from the prosecutor’s fallacy, this is a prime example of why the prosecutor’s fallacy is a fallacy: it’s citing old stats that were based in part on possibilities that now can (and should) be eliminated from consideration.
Xrlq (62cad4) — 7/11/2008 @ 4:24 am

6/18/2008

L.A. Times Now Properly Describing Statistics in Article About DNA, Cold Hits, and Databases

36 Responses to “L.A. Times Now Properly Describing Statistics in Article About DNA, Cold Hits, and Databases”

Favorite Sites

Links

Patterico Sells Out