Patterico's Pontifications

5/7/2008

My Proposed E-Mail to the Authors of the L.A. Times Piece on DNA and Cold Hits

Filed under: Crime,Dog Trainer,General — Patterico @ 6:57 am



It might seem a little odd for me to vet an e-mail I am planning to send by publishing a draft of it on a public website that receives thousands of hits every day. But hey, odd is fun! And so I invite you to read this draft (yet unsent) of a letter to the authors of the recent L.A. Times article on DNA, cold hits, and statistics.

I’d like readers to review it before I send it because I am not a statistics expert, and although I consulted with more than one during the process of drafting this, I want to make sure I have made no mathematical or logical misstatements.

Here it is:

Mr. Felch and Ms. Dolan,

After discussions with numerous people with statistical expertise, I am reasonably confident (that is, as confident as a layman like myself can be) that your recent front-page article on DNA cold case statistics gravely misstated the meaning of the math you discuss.

Your article said:

Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person.

In Puckett’s case, it was 1 in 3.

I don’t believe the math in question supports the statement that there was a “1 in 3” chance that “the database search had hit upon an innocent person” in selecting Puckett.

The starting point for my analysis was this post by Eugene Volokh, a UCLA law professor and blogger. Prof. Volokh agrees with me that your formulation is wrong. He justifies his argument with effective argumentation and examples; I commend his post to you. My e-mail to you (which I am blogging on my site) merely expands on Prof. Volokh’s argument as it relates to the article.

(To keep the discussion simple, I will assume there are no issues relating to data corruption or human error. I’ll also stick with the numbers used in your article: a random match probability of 1 in 1.1 million, and a database of 338,000.)

The logic behind the database adjustment was expressed in a report from the National Research Council as follows:

Recommendation 5.1 proposes multiplying the random-match probability (P) by the number of people in the database (N). If the person who left the evidence DNA was not in the database of felons, then the probability that at least one of the profiles in the database would also match the incriminating profile cannot exceed NP.

The clear working assumption here is that the database consists of “innocent” people who did not leave the DNA in the database.

This makes sense, at least in a hypothetical case where the jury is informed that the authorities came to suspect the defendant because of a database hit. There is a certain “what are the chances?!” quality of DNA evidence that presumes the defendant was under suspicion before the DNA comparison was done. In other words, if the defendant is before the jury because of a database hit, and the jury knows it, the jury may be “wowed” by the fact of the hit. But the impact of this “wow” factor is considerably lessened likely if the jury is told that, in a hypothetical search of a database of completely innocent people, there is a 1/3 chance of a hit.

Thus, it seems clear to me that the idea of the adjustment is to communicate to the jury the likelihood of a false positive, based on the assumption that the true donor of the incriminating profile is not in the database.

My understanding is bolstered by an e-mail I received from Prof. David Kaye, who served on the 1996 NRC committee that recommended the adjustment. In that e-mail, Prof. Kaye stated:

[T]he statisticians who favor an adjustment to the random-match probability are considering [the question:] What is the chance that a search of a database will turn up exactly one match when the source of the crime-scene DNA is someone who is unrelated to everyone in the database?

He restated the question in this way:

What is the chance that a database composed entirely of innocent people (with respect to [the] crime being investigated) will show a match?

Note that the fundamental assumption of the hypothetical is that everyone in the database is innocent. Then, and only then, can one use the adjusted figure recommended by the committees as a (very rough) approximation of the chances of a false positive.

If by contrast, you start with the assumption that you don’t know whether the suspect is in the database or not, then the 1/3 number tells you nothing about whether a single hit from the database is a hit to a) the true donor of the incriminating DNA or b) an innocent person who happens to share the same profile (i.e. a “false positive”).

It’s important to keep in mind that what we’re talking about here is the situation where a database search is conducted, and has resulted in only one hit. The question is: what can we say, statistically, about that one hit?

In the case where you don’t know whether the database contains the the true donor, or “guilty” person (speaking very loosely), the meaning of a single hit from that database is a function of the likelihood that the true donor is in the database — and (given that only one hit was received) the likelihood that nobody else with that profile is in the database.

If you don’t know whether the true donor (or “guilty” person) is in the database or not, the 1/3 number is merely an expression of the likelihood of a hit — any hit. It’s not an expression of the chances that any resultant hit is a hit to an “innocent” person.

Again, I am not a statistics expert, and (perhaps as a result) I don’t know whether it is possible to tell juries anything statistically meaningful about the likelihood that the person in front of them is innocent. (Neither does Prof. Volokh, for what it’s worth.) But I feel fairly confident that the 1/3 number is not an expression of the probability that the person sitting in front of jurors is “innocent.”

Thus, I believe that your article is wrong to say, in the statement quoted above, that “the probability that the database search had hit upon an innocent person” in Puckett’s case was “1 in 3.”

That is simply not so, I believe.

If I’m right, I think The Times needs to correct this misimpression. What’s more, I think any correction should be very prominent, given the extreme prominence of the error (or what I believe to be an error) on the front page of the paper’s Sunday edition.

I hope you will see your way clear to discussing these issues with knowledgeable experts. I also hope that you will issue an appropriate and prominent correction if, after reflection and consultation with experts, you believe I have correctly analyzed the issue.

I look forward to your response.

P.S. I should note that my argument does not address the fact that guilt is not automatic once it is determined that the suspect is the donor of the DNA at the crime scene, just as innocence is not automatic once it is determined that he is not the donor. I assume you are aware of the difference between source attribution and guilt, and left out an explanation of the difference for space reasons.

Nor does my argument address the fact that the 1/3 number is an approximation of an approximation. (Prof. Volokh’s post has more details on the relevant statistics.) I also presume you were aware of this, and believe that the 1/3 number is simply a conservative simplification of the more complex equation that Prof. Volokh sets forth in his post.

My argument has nothing to do with these relatively minor quibbles. One could argue that ignoring them is necessary to keep the issue straightforward and simple. My problem is that, these minor issues aside, the way you have expressed the meaning of the adjusted number is (I believe) so misleading as to be fairly termed an error.

Please let me know what you think. I remain humble on the issue because of my lack of expertise in the field.

119 Responses to “My Proposed E-Mail to the Authors of the L.A. Times Piece on DNA and Cold Hits”

  1. It’s a good letter, but I doubt anyone at the Times will read it all the way through. Why read a long letter telling you that you are wrong. They will not care.
    However, points for trying to really present an honest accounting of the numbers

    DrT (340565)

  2. I like to give people the benefit of the doubt, and I’m hopeful that these reporters are willing to look at this honestly.

    The paper as an *institution* has a very poor track record of correcting errors when the explanation is long and difficult, as I have previously noted. Still, I approach each case with optimism and a hope that the particular individuals I’m dealing with are honorable.

    Patterico (4bda0b)

  3. Or I try to, anyway.

    Patterico (4bda0b)

  4. Pat, clearly the letter exposes the lack of foundation for the seminal point of the article. DNA matching in the database with these number of markers will NOT result in a hit 33% of the time on INNOCENT persons.

    And, as always…great job in issue spotting and for having the courage of conviction in presenting it to them for correction.

    However, I don’t believe you need to apologize or even explain away your personal familiarity with advanced statistics. I have tried numerous cases involving all manner of disciplines, (medical issues, advanced financial issues, engineering issues, etc) and I am not trained as a doctor, economist or engineer.

    I retained experts and sought out their opinions to gain a comfort in that particular area that needed explanation (and translation) to a lay jury.

    Your position as the “issue spotter” does not require that you personally possess the damning information in your background, but that respected and knowledgeable experts have been contacted and they have examined the evidence.

    The impact upon the lay public of this article (which, given the LA Times sordid history seems quite the point), is to suggest that prospective jurors should disregard admissible evidence and not give it proper weight…but lesser weight than it deserves.

    Of course, this type of propagandizing is precisely the point. Don’t give them fodder for using your own words against you, Pat. Take a look at your letter and try, for a moment…to visualize which quotes of yours they would take out of context and print.

    They would call this a “battle of the experts” at best, at worst..they would use your own admissions and juxtapose them against your protestations. The whole impact would be to make your agrument look weak…not strong.

    You are in the right here…go after them from a position of strength. Don’t weaken the message with unnecessary apologies for not being a statistics major. You are the attorney who brought forward the experts…it is inherent in that detail that you are not acting as the expert yourself.

    cfbleachers (4040c7)

  5. cfbleachers,

    I may take out the caveats if the draft survives scrutiny of this blog’s readership, as well as that of the knowledgeable people to whom I forwarded it for review.

    Patterico (d39cbe)

  6. I think DrT is correct that your letter won’t be read all the way through, but for a different reason: I think your opening weak. I would replace

    reasonably confident (that is, as confident as a layman like myself can be)

    with the single word “convinced” (you are convinced, aren’t you?) in order to ‘grab the attention of the reader.’

    I also think there are some similar phrasings that could be changed to reflect what you think more clearly.

    I have one other quibble, I would also replace “likelihood” with “possibility” instead. I think it important to remember that the statistical probabilities being discussed are a model of what we might expect to find in the real world, but we should also remember that the real world doesn’t play by our rules because if it did, there should have been two false positives concommitant with Puckett’s identification in the database, and there weren’t (unless I missed that somewhere?)

    Which is a major reason I find the Times article in error.

    EW1(SG) (84e813)

  7. Pat, I think cfbleachers is absolutely on the mark.

    EW1(SG) (84e813)

  8. I think your math is wrong, Patterico. Also it’s long and hard to read (I have a lot of chutzpah to say those things, considering that I’m guilty of both in my comments on these threads, but you’re thinking about sending this to the LAT so I might as well be honest)

    Daryl Herbert (4ecd4c)

  9. I don’t see the error. They’re posing a hypothetical “what if” question. What if everyone in the database is innocent? What is the likelihood that it would show exactly one positive under those circumstances?

    Gerald A (b9214e)

  10. Gerald, you’re coming a little late to the discussion, see prior posts here, here, here, here, here, and here.

    EW1(SG) (84e813)

  11. My opinion is also that the letter is too long. The key point is very simple, the DNA random match probability cannot be used to compute a probability of guilt without additional debatable assumptions. The quoted section assumes otherwise and is therefore wrong.

    DNA tests serve to filter out innocent people from a pool of suspects some of whom are guilty and some of whom are innocent thereby increasing the concentration of guilty people. But clearly the concentration of guilty people (which corresponds to the probability that somebody in the pool is guilty) after filtering (DNA testing) depends directly on the concentration of guilty people in the pool before filtering (as well as on how good the filtering is or in other words how likely the DNA test is to clear a random innocent). So in order to compute a probability of guilt you need to know the concentration of guilty people before filtering which in this case means the probability that database contains the guilty individual. But this is not known although it probably could be roughly estimated given the results of other cold case searches. But this does not appear to have been done and would not give a precise number in any case.

    Also while I agree the LA Times was wrong about this I don’t think it really addresses the main point of the article which is the defense appears to have been unfairly prevented from presenting relevant exculpatory information to the jury.

    James B. Shearer (fc887e)

  12. I agree with cfbleachers.

    nk (1e7806)

  13. You would present a more persuasive case if you would show the math in a more straightforward fashion. However I believe in doing so you would undermine your position.

    I still maintain that you do not actually grasp the fundamentals of probability. Would you be willing to include the link below from Karl Lembke along with your example from Mr. Volokh?

    http://ritestuff.blogspot.com/2008/05/shades-of-paulos-theres-been-discussion.html#bayes
    .
    But that aside, you would be better served if you could make your case in a more compact manner.

    Amused Observer (c99766)

  14. (Like your proposed letter above, this ignores the possibility of human error/lab error in putting the DB together and collecting samples from people and crime scenes. It also assumes that the DNA at the scene belongs to the perpetrator, which is not always the case.)

    The LAT relied on bad math, and as a result, is publicizing a formula that is wrong, and in many cases (but not the Puckett case) will be unfair to the accused.

    —-

    We know there is only one result returned from the DB. There are two, exclusive, possibilities: that the match returned is innocent, or guilty. We would like to know the probability that he is in fact guilty.

    This depends on the chance that the perpetrator is in the database to begin with: if there is a 0% chance the perp is in the DB, there is a 0% chance the DB will return a guilty suspect.

    For example, if you searched a database with DNA from Chinese nationals who had never left their country. It could have 1BN entries, and the chance of any of them being guilty is still 0%. If the perp is not in the DB, the DB won’t return the suspect.

    Likewise, if you know for a fact that the perp must be in the DB–if you have a DB of every person who ever lived, and you get only a single hit, or if you have a DB of every single person who was within 1 mile of the scene of the crime at the time, and you’re absolutely sure everyone is in the DB–then the “match” must be guilty. There was only 1 match, and the perp is in the DB, then he’s it.

    If there were two matches, then you would have a situation in which you could have 2 innocents or 1 innocent + 1 guilty. But in this case, there was only 1 match. If the perp is in the DB, and only 1 match is returned, then we know for sure that the perp is the match.

    —-

    Calculating the odds for 0% chance or 100% chance are easy. But how do you calculate in between?

    You need to run the numbers without assuming that only 1 hit will be returned from the DB. What is the chance that, if the perp IS in the DB, no innocent matches would be returned? The answer is:

    (1 – match_probability) ^ (size_of_DB – 1)

    Let’s call that figure G, for “chance of guilty match and no innocents, if perp is in the DB”

    G, in this case, is .7354

    Then you want to know, if the perp is not in of the DB, what is the chance of an innocent match being returned. The answer is:

    (1 – match_probability) ^ (size_of_DB – 1) * (match_probability) * (size_of_DB)

    Let’s call that figure I, for “chance of exactly 1 innocent match”

    I, in this case, is .2260

    I and G are the two exclusive possibilities that were mentioned at the start. To determine the probability that the match is indeed the perp, the formula is (using Bayesian reasoning):

    (G * P) / ( (G * P) + (I * (1 – P) )

    Where P is the likelihood that the perp IS in the DB.

    chance_of_guilty_match = .7354 * P / (.7354 * P + (.2260 * (1-P) )

    which can be simplified to:

    chance_of_guilty_match = .7354 * P / (.5094 * P + .2260)

    You can see that, when P == 0, the result is 0 (when there is no chance the perp is in the DB, there is no chance it will return a guilty match. Likewise, you can see that when P == 1, the chance of a guilty match is 1.

    —-

    The method the LAT used previously naively and improperly assumes that the chance the suspect is in the DB is 50/50–equal odds of being in or out. This is wrong, and in some cases will be unfair to the accused!

    The chance the perp’s DNA is in a database is going to depend on: what is the crime and what is the database? If this was a DNA database of randomly-selected Californians, the chance he would be in is equal to his proportion to the population at large. That’s going to be much less than a 50/50 shot that the perp is in/out of the DB, when the DB only has 338,000 people in it.

    There are about 36M Californians, which roughly means about 16M men of an age who could have committed that crime. There would only be a 2.1% chance that the guilty man was in a DB of 338,000 randomly-selected California men. Using the above formula, the meaning of a “match” would be: a 6.6% chance that the matching suspect is guilty.

    That’s much less than a “1 in 3 chance” that an innocent person was fingered! This is what I meant, when I said in my introduction that in many cases, the formula used by the LAT would be unfair to the accused.

    —-

    But this DB is for sex offenders who have been caught. The chance that the perpetrator of a 30-year-old stranger-rape-murder would be in the DB depends on the chance that he would be caught by now for other crimes. Who is to say what that probability is, that the man who raped and murdered Diana Sylvester is now in that DB? That depends on how likely we think it is that men who rape and murder strangers are to commit further crimes, and then be caught for them. That is simply not a probability that we can know.

    We can guess what percentage of violent sex offenders in California are in the DB:

    If we use 50/50 odds (i.e., we assume half of all violent sex offenders are in the DB, and therefore, there is a 50% chance the perpetrator is in the DB), the chance of an innocent match is 1 in 4.

    If we say 70% of violent sex offenders are the DB, the chance of an innocent match drops to 1 in 9.

    I don’t know if that figure (% of sex offenders in CA who are in the DB) is the same thing as (% likelihood that the perp of a given crime is in the DB). That depends on how likely we think it is that the person who committed that type of crime would re-offend serially until he is caught. There is an assumption that a man who would rape and murder a woman has probably raped before and will probably rape again. If that assumption is true, it is much more likely that Puckett is guilty than if it is false.

    There is no way, however, when looking at a crime scene to “know” whether it was committed by a typical rapist or if it’s something else. Our stereotypes about how rapists act are based on centuries of police work and academic studies (by criminologists, sociologists, psychologists, etc.). They are probably valid.

    One reason to doubt that the true rapist’s DNA would be in the DB is that the crime took place 30 decades years ago. The perp could have committed suicide, died of natural causes, or moved out of California in the meantime. How do you factor that into calculating the % chance that the true perp is in the DB? You can’t–not with any precision.

    —-

    I am comfortable, given the facts of this case, that Mr. Puckett was guilty beyond a reasonable doubt.

    If we say there is only a 10% chance the true rapist’s DNA was in the DB (it’s probably much higher), that would mean the odds Mr. Puckett is guilty, based on the DNA alone, would be 27%, or about 1-in-4.

    However, we can multiply that with other independent probabilities based on Bayesian formulas. The evidence against Mr. Puckett is that he is a sex offender, with a distinctive MO, and was in SF at the time and was in the general vicinity of the victim shortly before the time of the crime.

    The fact that he is a sex offender is not an independent probability–that’s why he’s in the DB to begin with! So that cannot be held against him. It would be double counting. But the rest of the evidence is independent.

    What percentage of sex offenders have an MO similar to Puckett’s? Probably fewer than 5% (and that’s high). What percentage of men in California were in the very general vicinity of Ms. Sylvester in the time before the attack? Probably fewer than 5%. (There are .4M men in SF, and as stated above, about 16M men across CA, which means only about 2.5% of Californian men live in all of SF)

    These probabilities can be combined as shown by wikipedia. The result is 100-to-1 odds of guilt.

    1 / (.05 * .05 * 4) == 100

    These probabilities, taken together, even when calculated with the generous “fudge factors” in Mr. Puckett’s favor (the reality is “probably” much more stark), are damning: 100-to-1 odds against innocence. They are damning beyond a reasonable doubt.

    Daryl Herbert (4ecd4c)

  15. Daryl,

    Using your probability progressions in the other thread, I am thinking that were we to find 27 matches of the genetic profile under discussion in the database under discussion, we will have identified the genetic profile for child molesters. What do you think?

    nk (1e7806)

  16. One problem with a lengthly response is that you are likely to make errors yourself. I believe the following is wrong:

    “If by contrast, you start with the assumption that you don’t know whether the suspect is in the database or not, then the 1/3 number tells you nothing about whether a single hit from the database is a hit to a) the true donor of the incriminating DNA or b) an innocent person who happens to share the same profile (i.e. a “false positive”).”

    I think “tells you nothing” is wrong. It would be better for the prosecution if the 1/3 were .0001 and better for the defense if it were .9999. This is because you don’t actually know nothing about how likely the database is to contain the guilty party, you can estimate a plausible range and this will give a plausible range for the probability of guilt.

    James B. Shearer (fc887e)

  17. Using your probability progressions in the other thread, I am thinking that were we to find 27 matches of the genetic profile under discussion in the database under discussion, we will have identified the genetic profile for child molesters. What do you think?

    I do think, if a certain “profile” occurs much more often than by chance, that it could be a sign that the profile has to do with substantive psychology of the offenders.

    However, there are a few reasons to be suspicious. First, the loci used for the database are currently thought to be “junk DNA” that have no real effect and are almost completely independent of the DNA that does matter.

    Second, it is also possible that many of the profiles would be there because of family connections between the abusers. The profiles are not independent where family is concerned (we use the profiles to establish paternity!). So if there is a case of a father/son who are both rapists (could be genetic reasons, or it could be that the dad and son were both molesting the daughter), that could cause some profiles to show up more than one would expect by random chance, but not be probative of the DNA profile having anything to do with child molestation.

    Also, we don’t know for sure what the incidence is of those profiles “in the wild” (in the population at large). It may just be that those profiles are more popular than previously thought.

    Despite those reservations, I would be surprised if no one has already run the tests you describe on the sex offender DB. At the very least, finding nothing would support our current understanding of DNA profiling.

    Daryl Herbert (4ecd4c)

  18. Delete “(I believe)” from the final paragraph. It’s implied, and detracts from your concluding sentence.

    aunursa (1b5bad)

  19. I think the LAT article, as Patterico quoted above, didn’t state the issue with optimal clarity. But I also don’t think any “lie” was intended, or that the misstatement was completely innacurate. To support the second point, the old apothegm “never attribute to malice what is adequately explained by error or incompetence” comes to mind. This is a stock refrain applied when addressing government error, but it should be equally applicable to non-government actors.

    For the first point, the accuracy of the LAT’s statement, here is my reasoning:

    If the random match probability is P, and the database DNA samples are statistically distributed so they represent an unbiased sample of DNA from the general population (note that has nothing to do with guilt or innocence, only with the distribution of DNA markers), then P*(Database size) is the probability of at least one “hit”. In this case it is about 1/3.

    So, in the case of ANY same sized set of unbiased samples from the general population (unbiased in the sense of the distribution of DNA markers) there is 1 chance in 3 of yielding a “hit”. This is true whether the “hit” is actually innocent or guilty.

    I see no reason not to tell the jury that. It is simply a fact.

    Patterico quoted the LAT article:

    Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person.

    That is a clumsy, but not entirely inaccurate statement of the fact that a search of ANY same sized database (composed of unbiased distribution of DNA markers) would yield a “hit” with the same probability as the search of the particular database that was searched. So, in that sense, it is the approximate probability of an actual innocent being a “hit”.

    That is because there is theoretically an infinite number of possible unbiased sample databases that do NOT contain the DNA of the actually guilty party.

    One problem I see with not revealing that fact of a 1/3 probability of a hit in ANY database to the jury (stated correctly) is that failing to do so invites the jury to apply dependent probabilities as if they were independent, as in People v. Collins.

    The jury is invited to reason:

    1. The probability of this guy turning up is 1 in a million.

    2. The probability that the same guy is alive is xxx.

    3. The probability that the same guy was in the same area as the crime is yyy.

    4. The probability that the same guy would be of the same race as the victim described the perpetrator is zzz.

    5. … more of same.

    6. Multiply the probabilities all together and the probability that the “hit” is not the perpetrator is very tiny. So he must have done it.

    But those are not all independent probabilities.

    I see no technical problem with correcting the LAT’s statement. But I also think the LAT (as quoted here) did state the general gist of the problem correctly, just not the detail. One risk inherent in correcting such errors is that the statement of the correction is likely to put the jury (in this case the public readership) to sleep. That doesn’t mean one should not state the correction, but it does mean that the correction might not be widely accepted for reasons that have nothing to do with its correctness.

    I also think that the length of comments on this issue in these comment threads, by people with far more expertise than I, illustrates the general difficulty with correcting technical errors succinctly. No doubt someone can post at even greater length to correct mine.

    Occasional Reader (94daf6)

  20. If the probability of having 6 of the DNA markers is 1 out of 1.1million, then the probability of finding an innocent person who has those 6 markers remains at 1 out of 1.1 million.

    Assuming only one person is in the pool is guilty and all others are innocent, the probability of selecting an innocent person who also still has those 6 markers remains at 1 out of 1.1 million. Assuming everyone in the pool is innocent, the probability of selecting an innocent person who has those 6 markers still remains unchanged.

    Joe - Dallas (652b46)

  21. Occasional reader, I already have. (See comment #12). Succinctly:

    The key to this problem is that you already know there is exactly 1 result returned from the DB, which means you know one of two distinct events took place:

    Guilty guy + no random matches

    OR

    exactly 1 random match in a DB full of innocents

    NOTHING ELSE is possible. So you must compare the relative likelihood of those two events.

    Unfortunately, the relative likelihood of those two events depends on how likely the DB is to contain the guilty guy–which is a figure that is almost impossible to derive, or even approximate.

    Daryl Herbert (4ecd4c)

  22. No, Joe. You just stated the prosecutor’s fallacy.

    nk (1e7806)

  23. We can calculate the true probative value of the DNA by comparing:

    1 – the chance the suspect is guilty, absent all evidence (1 in 16M, because he’s one man out of 16M)

    2 – the chance the suspect is guilty, if all we have is the DNA evidence (use the formula I provided)

    Divide #2 by #1. (Which is to say, multiply #2 by 16M.) That tells you how much of a difference the DNA makes.

    The judge could compare this number to how he thinks the jury will interpret the DNA evidence, to decide if there would be undue prejudice.

    The judge would have to decide what % chance he felt likely that the perp would be in the offender DB. The judge would have to decide this based on testimony from defense and prosecution experts.

    If the judge decided there was a 10% likelihood the true perp’s DNA was in the DB, for example, the calculations would be as follows:

    Prior to considering DNA evidence, and all other evidence, the chance of guilt is 1 in 16M (real perp was 1 out of all the men in California)

    And after DNA evidence, the chance of guilt was 1 in 4.

    Then the true value of DNA evidence is that it changes the odds by 4M:1.

    A probative value of 4M:1 is astronomically high. It would have to have extreme prejudice in order to justify keeping it out.

    The only thing I can think of that has enough prejudice to keep pace with that kind of probative value is the prosecutor’s fallacy.

    The solution is not to keep the DNA out, but to issue a limiting instruction preventing the prosecutor from misleading the jury with the prosecutor’s fallacy. Judges need to get a grip on the PF, and need to keep it out of their courtrooms.

    This is not to say that the jury should be told the 4M:1 figure. They would likely misinterpret it. It is only so that the judge can decide if allowing DNA evidence in would be more prejudicial than it is truly probative.

    Daryl Herbert (4ecd4c)

  24. There’s a mistake in my post #23.

    4M:1 is not the probative value of the DNA alone, it’s the probative value of the DNA + the fact that the suspect is in the DB (i.e., the fact that he’s a sex offender)

    I regret the error.

    Daryl Herbert (4ecd4c)

  25. 21

    “Unfortunately, the relative likelihood of those two events depends on how likely the DB is to contain the guilty guy–which is a figure that is almost impossible to derive, or even approximate.”

    I disagree with the approximate part. People run these searches all the time. You can look at how often they get hits to estimate the probability of getting a hit. Suppose for example 300 searches out of 1000 produced hits. Now some of those hits are false but we have an estimate for the probability of a false hit (usually computed in a conservative way so it is more like an upper bound on the probability of a false hit). For the case at issue this probability was about .3 (summed over the entire database). However this was unusually high because the sample was bad and you could only search on some of the markers. With a good sample a more typical probability of a false hit might be more like .0003 (again summed over the entire database, meaning a random match probability more like one in billion instead of one in a million). So let us assume the estimated number of false hits for all 1000 searches is between 0 and 3. Then the number of true hits would be between 297 and 300 so we could estimate the probability of a true hit as about .3 which is also the probability that the database contains the guilty guy (assuming no errors in handling the the DNA). This estimate is subject to statistical error arising from the finite (ie 1000) size of the sample of searches. It also is subject to bias if our search is not typical of the searches in our sample. You can try to adjust for this by example by looking at the success rate for searches on old crimes vrs searches on recent crimes or LA crimes vrs SF crimes. As long as the crime we are searching is not wildly different from previous crimes searched for which the results are known we should be able to roughly estimate the probability of a true hit.

    James B. Shearer (fc887e)

  26. No I did not state the prosecutor’s fallacy. The chance of a hit with those markers remains at 1 out of 1.1m. Of those individuals with a hit, the chance of being innocent may be 1 of 3, but the first hurdle of having those six markers is still at 1 of 1.1m.

    The probability of having the 6 markers and being innocent (having met both criteria instead of the single criteria of being innocent) is closer to 1 out of 1.m divided by (.667) (2 out of 3).

    Look at it with the red bean test in a jar. 1 red bean and 9 white beans, the chance of pulling red bean out of jar is 1 of 10. Second jar 1 red bean 9 white beans, the probability of pulling red bean in second jar is 10%. The chance of pulling red bean in both jars is slightly less than 1%. The probability of having a hit (both red beans) and picking the innocent red bean is approximately 1/2 of 1%. Again, the test is meeting both criteria and not the single criteria as if they are not interdependent.

    Joe - Dallas (652b46)

  27. G, in this case, is .7354

    Then you want to know, if the perp is not in of the DB, what is the chance of an innocent match being returned. The answer is:

    (1 – match_probability) ^ (size_of_DB – 1) * (match_probability) * (size_of_DB)

    Let’s call that figure I, for “chance of exactly 1 innocent match”

    I, in this case, is .2260

    Uhhhmmm sorry Daryl, but it seems to me these two probabilities should sum to one, and they don’t so your calculation is off. You ignore that there is also a chance of two matches, three matches and so on. While these higher number of matches are unlikely, they are not zero as seen by the rather large discrepency between your .2260 and the actual probability of 0.2645. Next time try using the binomial distribution and using mactches >= 1.

    Where P is the likelihood that the perp IS in the DB.

    chance_of_guilty_match = .7354 * P / (.7354 * P + (.2260 * (1-P) )

    which can be simplified to:

    chance_of_guilty_match = .7354 * P / (.5094 * P + .2260)

    You can see that, when P == 0, the result is 0 (when there is no chance the perp is in the DB, there is no chance it will return a guilty match. Likewise, you can see that when P == 1, the chance of a guilty match is 1.

    You don’t usually want to set your prior probability (of guilt in this case) to either 0 or 1. In Beysian circles such priors are considered dogmatic in that no amount of evidence to the contrary will shift you away from that prior. Further, P is actually the prior probability of guilt, not whether or not the perp is in the database. That is you are interested in the following

    Prob(Guilt|DNA Match)

    using Bayes theorem this becomes,

    Prob(Guilt|DNA Match) = Prob(DNA Match|Guilt)*Prob(Guilt)/Prob(DNA Match).

    The Prob(DNA Match) is the 0.2646 I noted above. Prob(Guilt) is the prior probability of guilt prior to observing the DNA match.

    That is the proper and correct way to use Bayes theorem not your version.

    What is critical here is that Prob(Guilt) isn’t obvious. Should we use 1/population, 1/(population in a given radius), or some other number.

    Based the following formulations:

    Prob(DNA Match|Guilt) = 0.9999999
    Prob(Guilt) = 1/10
    Prob(DNA Match) = 0.2646

    and running it through Bayes theorem the Probability of guilt is only 0.3779.

    Running the same numbers for the following priors yeilds the following results:

    Prob(G) = 0.5 => Prob(G|DNA Match) = 1
    Prob(G) = 1/3 => Prob(G|DNA Match) = 1
    Prob(G) = 0.25 => Prob(G|DNA Match) = 0.945
    Prob(G) = 0.2 => Prob(G|DNA Match) = 0.756
    Prob(G) = 1/6 => Prob(G|DNA Match) = 0.63

    So even for fairly “high” priors such as 0.2 and 1/6 there is more than sufficient “reasonable doubt”.

    Don’t send the letter Patterico, you are wrong on this one.

    Steve Verdon (94c667)

  28. One more important point, on the prior probability (density) is that you should specify it PRIOR to seeing the data, hence the name. Once you obtain some data you update that prior via Bayes theorem to obtain what is generally called a “posterior” or “posterior distribution”. If you are in a situation where you can get new additional data this posterior becomes your new prior and you do the process again. Hence Bayes theorem allows for one to learn.

    Some might object to the prior probability as well, but in cases where you will be getting new data and you will be getting such data in “sufficient” quantities it will swamp anything but a dogmatic prior (which I’ve warned Daryl about using).

    How all this would work in a legal/court setting I don’t know as I am not a lawyer. But my post, #27, is how you would use Bayes theorem in this case to try and figure out the guilt of a person. I think prosecutors should be aware of it and should consider using it in trying to determine whether or not to bring charges. As for using it in court such as with a jury…again I don’t know as I am not a lawyer. Having a mathematical formula for determining guilt is not something I’m completely comfortable with. When putting something into numbers like was done here, by people who know quite a bit about he subject relative to the jurors…well I can see it having undue influence on jurors.

    Steve Verdon (94c667)

  29. Steve: they are not supposed to sum to 1.0, specifically because I did account for the possibility of getting 2 innocent matches, 3 innocent matches, etc.

    Let me use a bean example to show why it doesn’t have to sum to 1: I have two jars of beans. Each jar has 10 beans in it. Jar A has 1 red bean only. Jar B has 5 red beans.

    If I tell you I picked from one jar, and I got a red bean, what is the probability that I got the red bean from Jar A?

    chance of getting red bean if I picked from Jar A: 1/10

    chance of getting red bean if I picked from Jar B: 5/10

    So you might say, there’s a 1/6 chance that I got it from Jar A. But that would be wrong. That’s only true if I randomly choose between the Jars, without favoring either one.

    Assume I randomly choose between the Jars, and then randomly choose a bean (and I can’t tell the difference between red and not-red beans while I’m picking) and let P represent that chance that I choose from Jar B.

    The probability I will get a red bean is:

    p x (5/10) + (1-p) x (1/10)

    The probability, if I have a red bean, that I picked from Jar A as opposed to Jar B, is:

    1/10 * (1-p)
    ———————————————
    (1/10 * (1-p)) + (5/10 * p)

    You can see that if p == 0, this resolves to 1. That is correct (if I never choose Jar B, and I have a red bean, I am absolutely sure the red bean came from Jar A)

    Also if p == 1, this resolves to 0. If I never choose from Jar A, and I have a red bean, you can be sure it came from Jar B.

    If p == 1/2, this resolves to 1/6. This is the expected result.

    This equation is correct.

    If you add up 1/10 and 5/10, you only get 6/10. It doesn’t add up to 1. It’s not supposed to. We are looking for the relative probability between two exclusive events.

    In fact, if I have 7 red beans in Jar A, and 8 red beans in Jar B, the sum would be 1.5, which is well over one. That’s not a problem.

    Daryl Herbert (4ecd4c)

  30. Steve: they are not supposed to sum to 1.0, specifically because I did account for the possibility of getting 2 innocent matches, 3 innocent matches, etc.

    There is no logical reason not to include 2, 3 or more matches. Your calculation is incomplete.

    Let me use a bean example to show why it doesn’t have to sum to 1: I have two jars of beans. Each jar has 10 beans in it. Jar A has 1 red bean only. Jar B has 5 red beans.

    You are confused, here in that your example is not analogous. We are selecting from only one database, not multiple databases.

    So you might say, there’s a 1/6 chance that I got it from Jar A. But that would be wrong. That’s only true if I randomly choose between the Jars, without favoring either one.

    This actually deals with the prior probability not the probabilities of picking a red bean. You are really confused.

    You can see that if p == 0, this resolves to 1. That is correct (if I never choose Jar B, and I have a red bean, I am absolutely sure the red bean came from Jar A)

    You shouldn’t set a prior equal to 1 without damned good reason because then you have a dogmatic prior and no amount of evidence will shift you away from the hypothesis that the prior is concerned with.

    Your analysis is seriously flawed and stems from a poor grasp of probability theory and Bayesian analysis in specific.

    In any event, even if we go with your lower probability of 0.226 there is still more than sufficient reason to doubt that the hit was for the right man. Unless you set your prior to 1 or a really high number. And keep in mind, given how you have formulated the issue,

    Prob(Guilt|DNA Match)

    The prior probability is Prob(Guilt) not being in the database.

    Finally, I think one could argue that the investigators should have had a prior of 1/338,000 (or even larger) given that they had no other reason to suspect anyone in the database. Given this the probability of guilt given a DNA match is only 0.00000131 (even using your number). Now using that as your new prior you could update with the information that Puckett had committed rapes before and had been in San Francisco around the time of the rape. But you should also factor in that Puckett’s MO doesn’t match (i.e. he never killed future victims). I don’t see that kind of evidence raising that probability all that much, even an order of magnitude increase would leave you with a probability that is amazingly un-impressive.

    Steve Verdon (94c667)

  31. Daryl is right except for giving the wrong value of G. In this case G is the probaility of no matches to innocent people in the data base or 2 or more matches to innocent people in the data base. It should be 0.7740 not 0.7354. This does not make a great difference to his calculated probabilities.

    The formula that the LA times gives is wrong because it does not take account of the probability of the database containing the guilty person. The true figure could be higher or lower than this. However the main point of the article is that simply quoting the random match probability is highly misleading in this case. That one has to take account of the chance of getting a false positive in a data base search. This point is correct.

    One can put a minimum value on the probability of the data base contains the actual culprit. This will be the coverage that it would have if it was collected from the entire relevant population at random. In an old case like this it is hard to say what this relevat population is so one could be cautis and say the entire population of the national population that could concievably have done it. In fact this probility will be higher than this but how much higher?

    Now if we had undegraded DNA the chances of a match to someone else might be so low that we can exclude the possibility of a false match. But we don’t in this case.

    The LA Times figures are certainly far closer to the real probabilities than those quoted by the prosecution. Yes, point out that they need to know the probability of the data base including the suspect. But their point is that the random match probability is misleading when you are looking at the results of a data base search. This point is correct and if you write a letter you should acknowledge this.

    Lloyd Flack (ddd1ac)

  32. No, Daryl got that right. That probability is the probability of no hits on the database, not of guilt though. Obviously if that is the probability of no hits, then 1 – 0.7354 is the probability of 1 or more hits. There is no logical reason to exclude 2, 3 or more hits so that is the number one should go with when wondering what is the chance of at least one false positive. Granted, getting 2 false positives is highly unlikely, but it could still happen.

    Steve Verdon (94c667)

  33. Daryl,

    Here is another reason your red bean example doesn’t help you on the issue of number of matches. In your example you ask me which jar you picked from. That isn’t the same type of question what is the probability of getting 1 or more hits. An analogous question would be, what are the chances that I get 1 or more red beans. Since you are, apparently, making one draw in your bean jars example, we can simply set the probability of more than one red bean equal to zero.

    A question analogous to “which jar did I pick from” to the Puckett case is, “What are the chances he is guilty?”

    Steve Verdon (94c667)

  34. Steve, Daryl is answering Patterico’s question which is “Given that that there is one hit on the data base whatis the probability that it is the culprit?” You are right on saying that he is using the wrong value of G but two or more hits on innocent people goes into the same pot as no hits on innocent people.

    Lloyd Flack (ddd1ac)

  35. Lloyd,

    That isn’t the question in the article though. The question in the article is about false positives. While 1 is the most likely, 2, 3, or more are also possible and shouldn’t be ignored.

    In fact, if you have two hits, then you really have a problem. For example if you got two sex offenders for Puckett’s case then it must be the case that one of them is innocent. The chance of that happening with the numbers in the Puckett case is surprisingly large, IMO….3.47%.

    The prosecution made a mistake in the Puckett case. Hopefully it will be tossed on appeal.

    Steve Verdon (4c0bd6)

  36. So did the judge by the way in not allowing the relevant information to go before the jury.

    Steve Verdon (4c0bd6)

  37. Steve & Daryl – I’ve been reading your exchanges and can follow your disagreements, but I don’t have the mathematical background to ascertain which of you is correct.

    Which places me in the same set as most prospective jurors.

    What bothers me regarding this Times article is a clouding of the overall value of DNA used as evidence. If most laymen cannot judge the veracity of the arguments from experts, there’s a real possibility due to this article that DNA evidence as a whole could be judged “inaccurate” or “misleading”, which would be inaccurate and misleading in and of itself. I know Patterico has specific gripes regarding the information presented, but my problem with the article is that it infers that you can’t trust what you’re being told from the prosecution regarding DNA evidence. It’s understandable, but untrue. If you can’t tell the difference between mathematical formulas regarding probity, how can you tell the difference between 5 and 13 loci matches, something that nobody here’s arguing?

    Apogee (366e8b)

  38. Steve & Daryl – I’ve been reading your exchanges and can follow your disagreements, but I don’t have the mathematical background to ascertain which of you is correct.

    Which places me in the same set as most prospective jurors.

    What I worry about regarding this Times article is a clouding of the overall value of DNA used as evidence. If most laymen cannot judge the veracity of the arguments from experts, there’s a real possibility due to this article that DNA evidence could be judged “inaccurate” or “misleading”, which would be inaccurate and misleading in and of itself. I know Patterico has specific gripes regarding the information presented, but my problem with the article is that it infers that you can’t trust what you’re being told by the prosecution regarding DNA evidence. It’s understandable, but untrue. If you can’t tell the difference between mathematical formulas regarding probity, how can you tell the difference between 5 and 13 loci matches, something that nobody here’s arguing?

    Apogee (366e8b)

  39. Ecch. Double post. sorry.

    Apogee (366e8b)

  40. Steve, I know my beans example isn’t a perfect analogy. (To make a perfect analogy, it would have to be as complicated as the original.) I just used it to show that the two figures did not need to add up to 1, and to illustrate how to use the formula to calculate relative probabilities.

    Steve wrote: There is no logical reason to exclude 2, 3 or more hits

    There is a very good logical reason to exclude them: we know that the DB returned exactly one hit. That precludes the possibility of two or more innocent matches.

    We are asking: what is the significance of this DB returning exactly one hit?

    And the question then becomes: which DB did we look in? Did we look in the database that has the guilty suspect, or did we look in the DB that does not have the guilty suspect?

    In that sense, we are trying to resolve the same question (which jar did this bean come from). The jar with the perp has a 73.5% chance of giving a “red bean.” The jar without the perp has a 22.6% chance of giving a “red bean.” We have a red bean.

    Now we need to know: which jar did we pull it out of? What is the probability that we pulled it out of the perp jar?

    Daryl Herbert (4ecd4c)

  41. A question analogous to “which jar did I pick from” to the Puckett case is, “What are the chances he is guilty?”

    No. If there is a 50% chance that you picked from the DB with the perp in it, that does not mean there is a 50% chance that he is guilty.

    In fact, there would be a 76.5% chance of guilt (a 1/4 chance that the person is innocent).

    Daryl Herbert (4ecd4c)

  42. To get the correct probability of a false positive we have to look at both the effect on the false positive rate of looking through a large data base and the probability of the culprit being in the data base. The LA Times looked at the first but not the second. The prosecution looked at neither.

    Given any realistic data base coverages the probability that the LA times gives of wrongly identifying an innocent suspect is of the right order of magnitude. The figure given by the prosecutor is about five orders of magnitude too low.

    There are errors in the LA Times article but the primary point is correct. A data base search with degraded DNA might only give probable cause rather than proof beyond reasonable doubt. The random match probability quoted by the prosecution is misleading in this case.

    Lloyd Flack (0c6a49)

  43. While 1 is the most likely, 2, 3, or more are also possible and shouldn’t be ignored.

    Of course it should be! We’re not talking about how many hits the database could have generated in Puckett’s case, but the number that it did. That number was 1. Not “probably 1, but possibly 2, 3 or more.” Just 1. Given that, as Daryl rightly noted above, there are only two possibilities:

    1. The killer wasn’t in the database. This was that 1 time out of 3 where the database randomly matches someone, and that poor schmuck just happened to be Puckett.
    2. The killer was in the database, and it was Puckett. This was that 2 times out of 3 where the database doesn’t randomly match anyone.

    Thus, the odds of Puckett being a false hit are the combined odds of both (1) The Real Killer not being in the database and (2) some random unlucky bastard getting a match. The only way to combine those two possibilities and still end up with the original 1 in 3 figure is to assume that the probability of the killer not being in the database was 1. And if you assume that, you’ve acquitted Puckett (who was in the database) right off the bat.

    Xrlq (62cad4)

  44. 43

    “Thus, the odds of Puckett being a false hit are the combined odds of both (1) The Real Killer not being in the database and (2) some random unlucky bastard getting a match. The only way to combine those two possibilities and still end up with the original 1 in 3 figure is to assume that the probability of the killer not being in the database was 1. And if you assume that, you’ve acquitted Puckett (who was in the database) right off the bat.”

    This is not correct. If you assume initially there was a 2/5 chance the database contained the killer and that the expected number of false hits was 1/3 then a single hit has about a 1/3 chance of being innocent.

    James B. Shearer (fc887e)

  45. Actually, if your assumptions are right the odds are worse than that. Suppose that the odds of exactly one false match really are 1 in 3, and the odds of the killer being in the database really are only 2 in 5. In that case, the odds of the two possible scenarios are as follows:

    Puckett innocent = 3/5 (killer not in DB) * 1/3 (one false match) = 3/15.

    Puckett guilty: 2/5 (killer in DB) * 2/3 (no false match) = 4/15.

    Granted, the odds of either scenario panning out were “only” 7/15, but never mind that; the other 8 scenarios resulted in either 0 or 2+ hits, and can therefore be ruled out. So we’re left with two possibilities, one of which had “only” a 3/15 chance of materializing, the other, a whopping 4/15. Yikes.

    Xrlq (62cad4)

  46. James, you are correct.

    One one end, if there was a 0% chance of the guilty person in the DB, then there is a 0% chance that the resulting match is guilty.

    On the other end, if there was a 100% chance of the guilty person being in the DB, then there is a 100% chance that the resulting match is guilty.

    Between the two points (0,0) and (1,1) is a strictly increasing curve. A value of 1/3 is necessarily going to be achieved somewhere between 0 and 1.

    Daryl Herbert (4ecd4c)

  47. Xrlq, if James was using my numbers (that is to say, the correct numbers, with a 22.6% chance of exactly one random match) then the probability of guilt is 68%, assuming a 40% chance the killer is in the database, which means there’s a 1-in-3 chance of innocence.

    Granted, the odds of either scenario panning out were “only” 7/15

    No. You don’t add the probabilities together to find out how likely it is that it would be one or the other.

    In fact, if one or both probabilities are above 50%, you can end up with a sum greater than 100%. (See my bean example in comment #29.) Their sum is essentially meaningless.

    Daryl Herbert (4ecd4c)

  48. This is not correct. If you assume initially there was a 2/5 chance the database contained the killer and that the expected number of false hits was 1/3 then a single hit has about a 1/3 chance of being innocent.

    Pardon my naive question. If you assume initially that there was a 2/5 chance the database contained the killer and you got a single hit, isn’t that all you need to know to determine that the single hit has a 2/5 chance of being guilty? And therefore a 3/5 chance of being innocent?

    Maybe this is one of those fallacies that we statistics non-experts fall into.

    Patterico (4bda0b)

  49. It probably is, but at the end of a long day I’m not getting it.

    Patterico (4bda0b)

  50. If you get a single hit it can come about in two ways. It can be because the data base contains the killer and there were no false positives or it can be because the data base does not contain the killer and ther is exactly one false positive. Given that you have one hit the probability that it is the killer is (Probability killer in DB and 0 false positives)/(Probability killer in DB and 0 false positives)+Probability killer not in DB and 1 false positive))

    Say the probability of the killer being in the DB is 2/5, the probability of 0 false positives is 2/3, prob 1 false positive is 1/4 and 2 or mor false positives is 1/12.

    If you have 1 hit in this case then the probability that it is the killer is (2/5 x 2/3)/ (2/5 x 2/3) + (3/5 x 1/4)) = 16/25

    Lloyd Flack (0c6a49)

  51. Ooops, typo left bracket out.

    (2/5 x 2/3)/ ((2/5 x 2/3) + (3/5 x 1/4)) = 16/25

    Lloyd Flack (0c6a49)

  52. The problem is that the 2/5 chance of the killer being in the database only applies before you run the test. Once you run the test, some of those other 3/5 scenarios are now off the table. All you can do is examine the odds of all possible outcomes, then eliminate the ones that have been ruled out, and compare what’s left over.

    Taking Daryl’s numbers, a 40% chance of killer in DB and a 22.6% chance of 1 false hit (and, I presume, a 33.33% chance of one or more false hits) mean before you run the test, there is a:

    1. 40% x 66.67% = 26.7% chance that the killer was in the database and there were no false hits
    2. 60% x 22.6% = 13.6% chance that the killer was not in the database and there was exactly one false hit
    3. 59.7% chance that something else happens.

    Once you run the test and come back with exactly one hit, you can eliminate option #3, since any combination other than #1 or #2 would have resulted either in 0 matches or more than 1. So you’re left with two explanations of the data, one of which was roughly twice as likely to occur as the other.

    Bottom line: everything hinges on how likely the killer is to be in the database. If the 40% estimate is right, then so was the L.A. Times’s conclusion. And if it is anywhere close, every criminal conviction based on a DNA match alone has reasonable doubt written all over it.

    Xrlq (62cad4)

  53. #48 Patterico – you’re right about that. But the value of the single hit depends entirely on the probability of the real killer’s profile being in the database – which is indefinite; certainly not 999,999 out of 1,000,000.

    Also consider – all but one of the persons in the database are innocent regardless. The chance of some other person matching is the same, regardless.

    Rich Rostrom (7c21fc)

  54. No. You don’t add the probabilities together to find out how likely it is that it would be one or the other.

    You do, though, if you are attempting to illustrate the odds that either one or the other occurs, which was my point. The easiest way to compute the odds of exactly one match materializing is to compute the odds of each of the two scenarios that would produce exactly one match, then add them together.

    Xrlq (62cad4)

  55. Pardon my naive question. If you assume initially that there was a 2/5 chance the database contained the killer and you got a single hit, isn’t that all you need to know to determine that the single hit has a 2/5 chance of being guilty? And therefore a 3/5 chance of being innocent?

    No.

    The thing is, the DB with the culprit in it is about 3x as likely to return just_one_hit than the DB without the culprit in it. (73.5% vs. 22.6%.)

    That’s why you can assume, knowing only that you got a single hit, and that there was a 40% chance the DB had the culprit in it, that there is a 68% chance that your hit is the culprit.

    73.5*40 / (73.5*40 + 22.6*60) == 68%

    —-

    Here’s a simpler analogy to illustrate it:

    Assume that if the DB is innocent-only there is a 1% chance of returning a false positive.

    Assume that if the DB has the culprit, there is a 99% chance of returning the true positive only.

    If the odds are 50/50 that the culprit is in the DB, then here are the possible outcomes:

    GUILTY IN DB:
    0: 0%
    1: 99%
    2: 1%

    NOT IN DB:
    0: 99%
    1: 1%
    2: 0%

    EXPECTED OVERALL:
    0: 49.5%
    1: 50%
    2: .5%

    EXPECTED OVERALL, W/ BREAKDOWN:
    0: 49.5%
    1G: 49.5%
    1I: .5%
    2: .5%

    1G means 1 person is fingered, and he’s guilty.
    1I means 1 person is fingered, and he’s innocent.

    So 50% of the time, you would get exactly 1 hit back from the databases, as opposed to 2 hits or no hits.

    When the DB returns exactly one result (which happens 50% of the time), 99 times out of 100, you got there because the guilty DB fingered the guilty suspect.

    That is to say, exactly one “hit” would mean a 99% chance that the “hit” is guilty, even though there was only a 50% chance he would be in the DB to begin with.

    Because false positives are much rarer than true positives in that example (it’s not always so), even though there’s only a 50/50 chance the suspect will be in the DB, when the DB returns a “positive” (exactly 1 result) there is a 99% chance that that positive is a true positive. False positives are so much rarer, that when we have a positive, there is a 99% chance that it’s a true positive.

    That is what we are really measuring: when we have a positive, what is the chance that it will be a true positive.

    When only 1 result is returned, here’s the breakdown:

    1G: 49.5%
    1I: .5%

    This doesn’t add up to 100, so you need to sum them to get the denominator, and then put one of them on top.

    49.5 / (49.5 + .5) == .99

    This will tell you the percent chance, that if you have exactly 1 result returned by the DB, that that result is a true positive.

    Daryl Herbert (4ecd4c)

  56. And if it is anywhere close, every criminal conviction based on a DNA match alone has reasonable doubt written all over it.

    Which is the result that worried me in #38. All DNA matches are not equal, but this story is selling that idea.

    Apogee (366e8b)

  57. One theory we should be able to put to bed is the notion that matches become less reliable as the database grows. All other things being equal, a larger database increases the odds of a false hit, but also increases the odds of the killer being in there.

    Xrlq (62cad4)

  58. Xrlq: The easiest way to compute the odds of exactly one match materializing is to compute the odds of each of the two scenarios that would produce exactly one match, then add them together.

    I apologize; you are correct.

    I misread your numbers as not being weighted for the likelihood that the killer is/is not in the DB. Once you weight them for that, you can add them together, to get the likelihood that exactly 1 match will be the result.

    Daryl Herbert (4ecd4c)

  59. By Bayes’ Theorem
    Probabilty killer in data base given exactly one match
    = Probabilty killer in data base and one match/ probability one match

    = probability killer in data base and no false positives /probabilty of one match

    Lloyd Flack (0c6a49)

  60. 48 49

    No it isn’t. If the expected number of false hits is 1/3 as we are assuming then the chance that a database containing the killer will return a single hit is greater than the chance that a database not containing the killer will return a single hit. In fact it is about 3 times as likely. This means the odds ratio that the data base contains the killer increases by a factor of 3. The odds ratio is the probability the data base contains the killer divided by the probability that it doesn’t. Here this is initially (2/5)/(3/5)=2/3. Increasing this by a factor of 3 means the odds ratio becomes 2 which corresponds to a 2/3 probability the database contains the killer (since (2/3)/(1/3)=2).

    A simple example which illustrates the principle is suppose we have a large number, n, of suspects in a large number of cases. Some of the suspects say (2/5)*n are in fact guilty, the remaining (3/5)*n are innocent. Suppose we have a test which eliminates on average 2/3 of the innocents but none of the guilty. We apply the test. Now we have (2/5)*n guilty suspects, (1/5)*n innocent suspects and (2/5)*n exonerated suspects. So the fraction of guilty in the remaining pool of suspects has increased to 2/3 as some of the innocents have been filtered out.

    The actual case is more complicated as some of the guilty suspects are filtered out as well (when you get multiple hits) but the chance a guilty suspect remains (given a single hit) is about 3 times the chance an innocent suspect remains so the odds ratio still increases by a factor of 3 and the chance of guilt goes to 2/3. Here of course the suspect is the entire database and a suspect is guilty if the database contains the killer.

    James B. Shearer (fc887e)

  61. I left out

    probability of one match =

    probability killer in database and no false positives + probability killer not in data base and one false poitive

    Lloyd Flack (0c6a49)

  62. One theory we should be able to put to bed is the notion that matches become less reliable as the database grows. All other things being equal, a larger database increases the odds of a false hit, but also increases the odds of the killer being in there.

    If the database was much bigger, the probabilities could reverse (it would be more likely to have 1 single false positive than to have 0 false positives)

    If your task is catching crooks, a bigger DB is better. But if what you want is:

    “Given the DB spitting out exactly one match, I want good odds that he’s guilty”

    Then a bigger DB may be counter-productive. (The fact is, this task only interests us because we got exactly 1 match and we are working backwards. If it was normal to get 3-10 matches, we would be doing a different math problem. Just because a bigger DB is likely to confront us with different math problems doesn’t mean that it’s worse.)

    If there were 2M people in the DB, there is a 16% chance of getting 0 false positives and a 30% chance of getting 1 false positive. That would mean, if you thought there was a 50/50 chance the killer was in the DB, and exactly one match was returned, there would only be a 1/3 chance that he’s guilty.

    You would have to raise the odds that the killer was in the DB from 15% to 50% in order to keep pace at this particular task, when you increased the DB’s size from 338,000 to 2M.

    Daryl Herbert (4ecd4c)

  63. 57

    “One theory we should be able to put to bed is the notion that matches become less reliable as the database grows. All other things being equal, a larger database increases the odds of a false hit, but also increases the odds of the killer being in there.”

    Yes but databases generally get larger by adding people less likely to be guilty. So as you add people to the database the probability the last person added will generate a true hit is going down while the probability that the last person added will generate a false hit is remaining constant. So the fraction of false hits in the pool of all hits is increasing. In other words false hits become more of a problem when searching large pools of unlikely suspects as compared to small pools of likely suspects.

    James B. Shearer (fc887e)

  64. Am I the only reader who finds this thread reminiscent of this one?

    Xrlq (62cad4)

  65. Oh, and I was tired and made a slip. Daryl’s figure for Gin #14 is correct.

    Lloyd Flack (0c6a49)

  66. Xrlq: Bottom line: everything hinges on how likely the killer is to be in the database. . . . And if it is anywhere close, every criminal conviction based on a DNA match alone has reasonable doubt written all over it.

    This is a special case. It was not based on a “full” DNA match. It only involved 5.5 locations, because the DNA sample wasn’t in great condition, presumably after 30-odd years. There was a 1-in-1.1M chance that a person matched.

    In cases with a “full” match, the odds of an individual being an accidental match are lower than 1-in-1 billion. Usually even more zeroes than that.

    A pure DNA cold hit, based on a “full” 23 loci matching up, is powerful evidence. It blows everything away. (I put “full” in quotes because AFAIK there is no real reason scientists couldn’t eventually use 50 or 60 loci, if they really wanted to; they just currently don’t feel the need, because 23 usually get the job done.)

    The chance of a 338,000-person DB returning any false positives with 1-in-1BN is less than .0338%. It just about never happens. A cold hit on 23 loci is killer. No reasonable doubt.*

    * with the caveat that the 1-in-1BN odds only apply to strangers, and not to family members. Family members are much more likely to have similar DNA. They are still very unlikely to have the exact same 23 loci, but it might be enough to create reasonable doubt. the solution is to DNA test close family members of the accused, in order to knock out that possibility.

    Daryl Herbert (4ecd4c)

  67. If the database was much bigger, the probabilities could reverse (it would be more likely to have 1 single false positive than to have 0 false positives)

    Only if you changed the criteria for deciding who gets in the database and who doesn’t. I’m not advocating that, just observing that as long as the criteria remain as they are, a larger database increases the odds of a false hit and the odds of the killer being in there uniformly. Unless it systematically increases one relative to the other, the overall odds should remain the same, no?

    You would have to raise the odds that the killer was in the DB from 15% to 50% in order to keep pace at this particular task, when you increased the DB’s size from 338,000 to 2M.

    My point exactly. Unless you change the criteria by which people are added to the DB (in which case all bets would be off), shouldn’t sextupling the database also sextuple the odds that the killer is in there (while also sextupling the expected number of false hits, but having a more modest impact on the odds that exactly 1 false hit will occur)?

    Xrlq (62cad4)

  68. 66

    “The chance of a 338,000-person DB returning any false positives with 1-in-1BN is less than .0338%. It just about never happens. A cold hit on 23 loci is killer. No reasonable doubt.*

    No, it depends on how likely the person was the killer absent the DNA match. The match increases the odds ratio by one billion but this may be insufficient for proof beyong a reasonable doubt if the other evidence says the odds are very low. For example if the suspect was a passenger on the space shuttle at the time of the crime. Of course there could still be an elaborate conspiracy whereby the suspect was guilty but the a priori odds of that are likely less than 1 in a billion leaving reasonable doubt even after the DNA hit. Of course this sort of thing doesn’t happen very often but if for example you ran 1000 searches against a 1000000 name data base you could expect 1 false hit even with a 1 in a billion random match probability. So a hit shouldn’t be an automatic conviction when there is strong evidence the other way.

    James B. Shearer (fc887e)

  69. The probabilty of exactly 1 false positive will be at its maximum whent he expected number of false positives is 1. In this case it will be approximately 1/e. This is approximately 0.3679. I have used the Poisson approximation to the Binomial distribution which will be very close in this case.

    Lloyd Flack (0c6a49)

  70. Yikes! 69 comments!
    At the risk of repeating a point raised in that welter, I wade in…

    The pertinent question in all this really amounts to, what do we know, and when do we know it?

    On day 1, we have a crime scene and a DNA sample. It’s degraded, so we may not get a perfect match. But we search a convenient database using what we have.

    Now the country of Elbonia has started taking DNA samples of all its citizens the moment they’re born, in case they ever have to identify a kidnap victim. It happens there are 338,000 entries in the Elbonian National Database (END).

    You run the search. Because of the quality of the sample, the odds of finding a DNA match are one in 1.1 million. What are the odds that you will get at least one match at that level in the END?

    The answer is, about one in three, according to the formula in the Times; one in four according to my calculation using the Poisson distribution.

    Now, this is the probability, in the absence of any other information. Other information may cause us to exclude any matches in this database. For example, maybe Elbonians can’t bear to leave their mud weasels behind. Or maybe all the individuals in the database were born within the last decade.

    But the math remains the same. Pick a random database of the same size, and you have a significant chance of getting a hit with that DNA sample, by chance.

    That fact is true, and I suspect the Times will feel quite justified in sticking with it.

    Karl Lembke (7910b8)

  71. The other issue that keeps coming up appears to be a confusion between an a priori assessment of guilt vs. an a posteriori assessment.

    At the time the search was run against the criminal database, there was no a priori reason to believe the perpetrator was in that database. Only after a partial match came through was there an a posteriori reason to investigate further. This investigation turned up evidence which may or may not have been enough to convict. We may be about to find out.

    But the match itself is likely enough, in a database of that many individuals, that it’s not only not terribly compelling, it’s not even terribly surprising. We’d expect an average of one such match in any four databases of that size. If not the Elbonian National Database, then in the Kneebonian, Shinbonian, or Thighbonian database. (Fans of Dilbert will recognize that I’m being humerus.)

    If police had developed leads, identified Mr. Puckett as a person of interest, turned up the other evidence against him, and then tested his DNA against the sample from the crime scene, the a priori 1.1 million-to-one odds of a match would have been much more impressive. Since the testing happened in a different order, the a posteriori one in three or four odds of a match are nowhere near as compelling.

    Karl Lembke (dc2e4d)

  72. Karl #70 – Pick a random database of the same size, and you have a significant chance of getting a hit with that DNA sample, by chance.

    Yes, but will the times also stress your earlier statement Because of the quality of the sample, the odds of finding a DNA match are one in 1.1 million? Because that’s my question. It isn’t that the Times has the math wrong in this instance, it’s the presentation of this specific mathematical conclusion as some secret that reveals something about the accuracy of DNA samples in general. That misrepresentation is what this whole site is about. This isn’t important because of the specific math being accurate or not. It’s important due to the overall inaccuracy of the implication. For me, the math on this specific point is the baited hook that leads the reader in a false direction, which is why I don’t consider it as important.

    Apogee (366e8b)

  73. Ok, that and I don’t have the math chops to evaluate it.

    Apogee (366e8b)

  74. Cripes, but there is a serious misunderstanding of how Bayes theorem works.

    1. Bayes theorem can be stated as:

    Prob(A|B) = Prob(B|A)*Prob(A)/Prob(B)

    Can we all agree that this is a mathematical and logical fact?

    The question Daryl posed was:

    What is the probability that a person is guilty given that there is a DNA match?

    If that is the case the probability in question then is,

    Prob(G|DNA).

    Where G = guilt and DNA = DNA Match.

    Now we apply Bayes theorem above. As such, Daryl’s issue with who is and who is not in the data base is just not an issue. It is not part of the problem. The parts of the problem are:

    Prob(DNA|G)
    Prob(G)
    Prob(DNA)

    There is nothing there about the probability of so-and-so being in the database. That part of the discussion is simply not relevant.

    There is a very good logical reason to exclude them: we know that the DB returned exactly one hit. That precludes the possibility of two or more innocent matches.

    Two points.

    1. Yes, in this case it returned one hit. But Bayes theorem uses P(DNA) not P(DNA=1).

    2. Even if we use DNA = 1–i.e. one match, the probability is still high enough that with most priors the probability that Puckett, or anyone facing a similar set of numbers is guilty, absent other eveidence, is exceedingly small.

    So you are arguing over how many zeros are to the right of the decimal place. I shall grant your argument and take one away. Still, that order of magnitude improvement in terms of guilt still leaves a very very small probability of guilt.

    Xlrq,

    Thus, the odds of Puckett being a false hit are the combined odds of both (1) The Real Killer not being in the database and (2) some random unlucky bastard getting a match. The only way to combine those two possibilities and still end up with the original 1 in 3 figure is to assume that the probability of the killer not being in the database was 1.

    No, we don’t know a priori who is in that database in terms of answering the question. Hence Daryl’s setup is all wrong. In fact, in answering Daryl’s question and using Bayes theorem, which you must, the issue of who is in the database isn’t part of the problem.

    I have set out the Bayesian formulation, you can work through the numbers on your own, but your reasoning is quite incorrect.

    Loyd,

    Probabilty killer in data base given exactly one match
    = Probabilty killer in data base and one match/ probability one match

    = probability killer in data base and no false positives /probabilty of one match

    No. Lets set up the probability of interest here, once again a conditional probability.

    Prob(KIB|DNA=1)

    KIB = Killer in the Database
    DNA=1: there is one DNA Match.

    Using Bayes theorem we now get:

    Prob(KIB|DNA=1) = Prob(DNA=1|KIB)Prob(KIB)/Prob(DNA=1)

    This is slightly different that your formulation in that you had

    Prob(KIB|DNA=1) = Prob(KIB|DNA=1)Prob(KIB)/Prob(DNA=1)

    Which is not the correct formulation of Bayes theorem. Also, I’m not sure that

    Prob(KIB|NFP) = Prob(KIB|DNA=1)

    where NFB = No False Positives.

    Xrlq,

    Bottom line: everything hinges on how likely the killer is to be in the database. . . . And if it is anywhere close, every criminal conviction based on a DNA match alone has reasonable doubt written all over it.

    No. We have this problem even if the killer is not in the data base. Lloyd’s 0.754 number and the 0.246 are what you get via a vanilla application of the binomial distribution to this problem. That is, given any set of 338,000 DNA samples and chances of finding a match of 1 in 1.1 million then you have a 0.246 of finding a match. That is suppose we have two such databases with no overlap. Further, suppose one of the databases has the killer in it. Now, it is quite possible we could get a hit off of both databases. Unless I am mistaken the chances of this happening are 0.246. In fact, we could get several hits from each database. Your insistence that the issue of who is and who is not in the database is not really relevant.

    Also as to the general point to the size of the database, you increase the size and hold all other factors constant you will get more false positives. For example, using the Puckett numbers if we increase the size of the database by a factor of 5 the probability of getting 2 hits goes from 3% to 25.39% about an 8 fold increase. So can we stop this nonsense about the size of the database mmmkay? Sheesh.

    Oh, and even Prof. Kaye agrees on this last point.

    Steve Verdon (284bc5)

  75. Your insistence that the issue of who is and who is not in the database is not really relevant.

    Au contraire, it is not only relevant but crucial to the analysis. Consider these two hypos:

    1. We are 100% certain that the killer is in the database, and therefore will yield a match. We also know there’s a roughly 1 in 3 chance that someone other than the killer will get a false match. Then we query the database, and got exactly one hit. Guilty or not guilty?
    2. Same as #1, only this time we’re absolutely certain that the killer is not in the database. We still know that there’s a 1 in 3 chance of matching someone else, so we run the test for grits and shins, and lo and behold, Steve Verdon is a match. Guilty or not guilty?

    I don’t see how it’s possible to intelligently discuss the odds of a false match without comparing them to the odds of a true match.

    Xrlq (62cad4)

  76. If police had developed leads, identified Mr. Puckett as a person of interest, turned up the other evidence against him, and then tested his DNA against the sample from the crime scene, the a priori 1.1 million-to-one odds of a match would have been much more impressive. Since the testing happened in a different order, the a posteriori one in three or four odds of a match are nowhere near as compelling.

    Perhaps you meant to state this a different way, but the order in which an investigation develops its leads is less important than the quality of the leads themselves.

    For instance, a composite sketch via eyewitnesses may give a broader physical description than 5 or 5 1/2 DNA markers would. In fact, many times…MUCH broader. Yet, starting from this piece of information, other leads are developed through trained investigative techniques and compiling forensic evidence.

    You narrow down potential suspects as you sift through mountains of leads.

    If a match comes up in a database of known offenders and you work your way through a mountain of leads…and each lead continues to point toward that individual…making a stronger and stronger case…why would it matter that you hit on the DNA markers first, second, eighth, twentieth or 12758th in your order of leads?

    When a victim or eyewitness is shown photographs of page after page after page of individuals, sometimes there is a “hit” because the photograph resembles the perp…and the investigation launches into the whereabouts of that identified person, as well other evidence linking them to the crime or not…ensues.

    The development of leads may come in any order.

    Matching 5 or 6 DNA markers is a great starting point, a great mid-point and a great end point in the compilation of forensic evidence.

    The fact that only ONE match “hit” in a database of several hundred thousand KNOWN violent crime offenders…seems to indicate that the commonality of persons holding these 5-6 markers is not that frequent…a priori.

    The fact that this “hit” also involved a person who was in the area at the time of the crime, also increases the evidence and has probative value.

    As you increase the number of markers, you reduce the number of people sharing those particular genetic traits…but that is based upon a universe too large for criminal forensics to be meaningful.

    Little children, toddlers, advanced elderly, in some instances gender (rape/semen)…winnow down the universe of DNA “false” hits or matches from a forensic standpoint.

    A statistical “hit” on that person…is no “hit” at all. In fact, the mathematical chance of hitting randomly on ANYONE is 1 in 1.1 million, but the chance of hitting on someone with opportunity is MUCH larger. (all persons containing these 5 1/2 markers, who were in the area of the crime on the date and time in question). Statistically speaking, the chance of a “false” hit COMBINING those two facts…would reduce randomness to perhaps 1 in 50 million.

    Add to that, the chance of a known felon, who has previously committed similar crimes, who was in the area on the date and time in question, who also has the size, strength, gender (rape/semen) to commit the crime, having the same exact 5 1/2 markers as found at the crime seen and you reduce your universe drastically and your chance of a false hit to 1 in 100 million. Or more.

    In summary, you not only have to determine mathematically how many people out of 300 million would statistically share those 5 1/2 markers…you would then have to eliminate from that universe how many people are known to be outside the CRIME universe in the first place. A 98 year old man and an 8 year old girl do not belong in the universe from a crime solving standpoint.

    cfbleachers (4040c7)

  77. Steve Vernon
    We don’t need to use Bayes theorem. Simply applying the definition of conditional probability will be sufficient. Also the fact that there was one and only one match against the data base is relevant information that we can use.

    P(G | DNA = 1) = P (G & DNA = 1) / P(DNA = 1)

    where P(G) is the probability of guilt and P(DNA = k) is the probability of k hits on the data base.

    Now P(G | DNA=1) = P(KIB | DNA=1)

    =P(KIB & DNA=1) / P(DNA = 1)

    where P(KIB) is the probability that the killer is in the data base. If there was one hit and the killer was in the data base the suspect is guilt and of course there was also one hit on the data base. The suspect could have been guilty and there could have been more than one hit but there wasn’t in this case.

    Now P(KIB & DNA = 1) = P(KIB & FP =0)

    =P(KIB) x P(FP = 0)
    where P(FP = m) is the probability of getting m false positives.

    If the killer is in the data base and there is only one hit then it has to be the killer and the must be zero false positives. Since the chance of the killer being in the data base is independent of the chance of a given number of false positives then the probability of both happening is the product of the probabilities of either happening.

    Now for there to be one data base hit either the killer is in the data base and there are no false positives or the killer is not in the data base and there is one false positive.

    P(DNA = 1) = P(KIB & DNA =1) + P(KNIB & DNA=1)
    = P(KIB & FP=0) + P(KNIB & FP=1)
    = P(KIB) x P( FP=0) + P(KNIB) x P(FP=1)
    = P(KIB) x P( FP=0) + (1-P(KIB)) x P(FP=1)

    Therefore P(G | DNA =1) = P(KIB) x P(FP = 0) / (P(KIB) x P(FP = 0) + (1 – P(KIB)) x P(FP = 1))

    We can reformulate it using Bayes Theorem to come up with an alternative but equivalent expression. There is no need to do so.

    Once we work through the calculations with any likely data base coverage we get probabilities of guilt that are certainly probable cause but by themselves are certainly not proof beyond reasonable doubt. The LA Times article quotes a juror as saying that they placed a lot of weight on the quoted probability and might well have given a different verdict if allowance was made for how the hit was obtained.

    Lloyd Flack (ddd1ac)

  78. Your insistence that the issue of who is and who is not in the database is not really relevant.

    The issue of the probability that the true donor is in the database is everything, because we’re dealing with a case where we got only one hit.

    It helps when you confront the situation we’re actually discussing.

    You, Steve Verdon, are the only person who has said my letter is fundamentally wrong — and you have mistakenly accused me of being wrong in the past due to your PDS (Prosecution Derangement Syndrome), so I’m not just taking your word for it.

    When you say “Don’t send the letter Patterico, you are wrong on this one.” — are you saying that you believe the assertion of the LAT quoted in the post, that there is a 1 in 3 chance Puckett is innocent, is correct?

    No, you aren’t saying that. So I’m not wrong. The LAT is wrong.

    Come on, Steve Verdon. Is it correct to say there is a 1 in 3 chance Puckett is innocent? Give us a clear yes or no on that. I predict there is no way you will give us a clear yes; rather, you’ll try to answer a different question than the one I asked.

    Is it correct to say there is a 1 in 3 chance Puckett is innocent, Steve Verdon?

    Patterico (4bda0b)

  79. If so, then the chances he’s innocent are the same as they would be if you assumed the database was composed purely of innocent people. That’d be kinda weird.

    Patterico (4bda0b)

  80. Once we work through the calculations with any likely data base coverage we get probabilities of guilt that are certainly probable cause but by themselves are certainly not proof beyond reasonable doubt.

    Uhhhmmm no. We can or might get probabilities that are probable cause. For example, in the Puckett case for a prior of 1/338,000 the probability of guilt is extremely low. In general one could continue the investigation to see if additional evidence turns up, but based on just the DNA evidence and an non-informative prior probability we have little reason to think Puckett was the perp.

    As for the rest of your post it is interesting and may be correct for a single hit situation. However, I’m not too sure it works in general case where we generalize to “n hits”.

    Unfortunately I have to head to work, so I’ll try to comment further.

    Oh, but before I go,

    Is it correct to say there is a 1 in 3 chance Puckett is innocent, Steve Verdon?

    I do believe that this is incorrect. I think the true number is closer to 1 in 4. I believe Karl Lembke pointed this out a day or two ago.

    Steve Verdon (284bc5)

  81. ” do believe that this is incorrect. I think the true number is closer to 1 in 4. I believe Karl Lembke pointed this out a day or two.”

    As long as you continue to maintain that the likelihood the database contains the guilty person is irrelevant — even though this factor is obviously central to the analysis — I don’t really care about your view of the statistics. If you start from such a clearly flawed premise, anything you say can only distract from the search for the truth.

    Patterico (e32025)

  82. Is it correct to say there is a 1 in 3 chance Puckett is innocent, Steve Verdon?

    No one will ever mistake me for Steve Verdon, but I’ll answer anyway. Without knowing the probability that the perp was in the database, or what other corroborating evidence may have been found, it’s impossible to tell. If, however, we assume that (1) there is a 50-50 probability that any given killer will be in the database, and (2) the DNA quasi-match is the only evidence linking Puckett to the crime, then I can say without reservation that there is a roughly one in three chance that Puckett is innocent. Here’s why:

    Probability of true match: 1/2
    Probablility of no true match: 1/2
    Probability of false hit: 1/3
    Probability of no false hits: 2/3

    [Assume for simplicity that (1) false hit always yields exactly one, and (2) killer in database always results in a successful match. Neither assumption is true, but both should be OK for a rough illustration.]

    Probability of guilt (1 true match + 0 false hits) = 1/2 x 2/3 = 2/6 = 1/3.

    Probability of innocence (0 true matches + 1 false hit) = 1/2 x 1/3 = 1/6.

    Therefore, the odds of guilt vs. innocence are 1/3 to 1/6, or 2 to 1. Which is another way of saying 1 in 3.

    Xrlq (b71926)

  83. I’ve been trying to follow the math, and my head aches (that happened in my stats classes, too.) I wish to thank the participants, because eventually they’re going to educate me.

    One of the things that’s bothered me (and it may have been covered and I’ve missed it) is the notion of “degraded” DNA. How does it degrade? Is the strand shortened or are some of the “letters” changed (and how do we know that this has or has not happened.)

    Another thing is the independence of the markers. How many people in the database (and in the world) have marker A with value a, how many have B with b, …? If most of those in the database have Aa, how useful is A in comparisons to distinguish between samples?

    Finally, those in the database have been convicted and found guilty, or plead guilty. They may not have actually done the deeds they are guilty of. I know, a tiny fraction.

    htom (412a17)

  84. Steve, here is my analysis in Bayesian terms:

    P(KIB|DNA=1) = P(KIB) * P(DNA=1|KIB) / P(DNA=1)

    P(KIB|DNA=1) is probability that the killer is in the DB, given exactly 1 DNA hit. (If we have exactly 1 hit, and the killer is in the DB, then we know that the killer is guilty.)

    P(KIB) is the probability that the killer is in the DB. Let’s call this “P.”

    P(DNA=1) is the probability that a DB search will return exactly 1 hit.

    That’s going to be equal to (P * .7354 + .2260 * (1-P))
    (because we are adding together two exclusive probabilities, the chance that the killer is in the DB and there are 0 false positives, or the chance that the killer is out and there are no false positives)

    Finally, P(DNA=1|KIB) = .7354
    If we know the killer is in the DB, there is a 73.54% chance of having him and 0 false positives

    The equation is thus:

    P(KIB|DNA=1) = (.7354 * P) / (.7354 * P + .2260 * (1-P))

    Which is the equation I’ve been using all along. It’s a proper Bayesian inference.

    For example, in the Puckett case for a prior of 1/338,000 the probability of guilt is extremely low.

    First, I think that may be generous to the prosecution. You should not assume that the killer is among the SO population. I used prior odds of 1:16M and still came to the conclusion that Mr. Puckett was, beyond a reasonable doubt, guilty.

    Second, you don’t need to know the prior odds of guilt in order to use the equation I gave above in order to evaluate the value of a “cold hit.” You just need to know the prior odds that a killer would be in the DB, and you are updating that probability. When exactly one match is returned, the probability that the killer is in the DB is the probability that the match is the killer.

    (ignoring human error and accepting the prosecution’s numbers as to 1-in-1.1M)

    Daryl Herbert (4ecd4c)

  85. Patterico:

    One problem with your proposed letter is in the language.

    the likelihood of a false positive, based on the assumption that the true donor of the incriminating profile is not in the database.

    The likelihood of a false positive is the same whether the killer is in the DB or not. (Actually, it’s infintessimally smaller, because first you’re looking for a false positive among 337,999 and second you’re looking for it among 338,000.)

    What you want is, given a single positive result, the chance that it is a false positive. (I know you know this, but your letter isn’t clear.)

    If I’m right, I think The Times needs to correct this misimpression. What’s more, I think any correction should be very prominent, given the extreme prominence of the error (or what I believe to be an error) on the front page of the paper’s Sunday edition.

    No. This is arcane math stuff. It’s not getting a prominent correction.

    The correction should be something like this:

    “The chance of looking forward to a false positive is not the same thing as looking backwards, from a single matching result, and deciding the chance that the match is a false positive. The first statistic depends on the number of people in the database and the likelihood of a random person being a match. The second statistic cannot be calculated without knowing the likelihood that the true culprit is in the databse.

    The makeup of the database is key. If the database had been of randomly-selected Californians, rather than sex offenders, the probability that a single matching result was a sign of guilt is only 6%, much lower than would be suggested by the bad math we used in our prior article. In the Puckett case, the odds that a single matching result from the database would be a false positive are probably in the neighborhood of 1-in-3, depending on the likelihood that DNA from a decades-old unsolved rape/murder would belong to one of the sex offenders in the database for some other sex crime.

    We throw ourselves on the mercy of our readership and offer as penance this article about Bayesian reasoning on the front page, above the fold somewhere in the back.”

    Your letter should get to the point very quickly, describing what went wrong and why, and save the math proofs for the end of the letter. (Put the proofs in footnotes or something.) Think of it as an “executive summary.”

    Better yet, get Prof. Kaye or Prof. Volokh to write the letter. The LAT is more likely to accept a math correction from a professor than from a prosecutor.

    Daryl Herbert (4ecd4c)

  86. 79

    “If so, then the chances he’s innocent are the same as they would be if you assumed the database was composed purely of innocent people. That’d be kinda weird.”

    No this is wrong. If we knew everyone in the database was innocent then we would know Puckett was innocent and we would know we had a false hit. See my 44 and related discussion.

    James B. Shearer (fc887e)

  87. If so, then the chances he’s innocent are the same as they would be if you assumed the database was composed purely of innocent people. That’d be kinda weird.

    Not really. We’re assuming the database is composed of people who are innocent of that particular crime. As the law presumes.

    Karl Lembke (ff486c)

  88. OK, here’s my Bayesian calculation.

    The probability that a person in the database is guilty (G) given a DNA match (M) is:

    P(G|M) = P(M|G) * P(G) / P(M)

    P(M|G) is the probability that we’ll get a DNA match if the person is guilty. Let’s assume there are no fasle negative results, so that if the person is guilty, there is a 100% chance of a DNA match.
    P(M|G) = 1.0

    P(G) and P(M) are the probabilities of a random person being guilty, and being a match to the DNA sample, respectively.

    Now, if our statistical universe is the population of the globe, then we have:
    P(G) = 1/6.7 billion
    P(M) = 1/1.1 million

    Putting the numbers in the right slots, we get:
    P(G|M) = 1.0 * 1/6.7 billion ÷ 1/1.1 million

    With a tiny bit of algebra, we get

    P(G|M) = 1.0 * 1.1 million / 6.7 billion
    = 1/6090.

    Remember, this assumes the statistical universe is the global population. If we restrict the universe, P(G|M) drops.

    If our universe is the US population of 300 million, then P(G|M) = 1.1 million / 300 million, or 1/272.

    If other considerations allow us to narrow the statistical universe to ten people, then P(G|M) becomes 10/1.1 million, or 110,000 to one.

    So the question becomes, what is your statistical universe, and how have you narrowed it down, if indeed, you have?

    Karl Lembke (ff486c)

  89. If other considerations allow us to narrow the statistical universe to ten people, then P(G|M) becomes 10/1.1 million, or 110,000 to one.

    That doesn’t seem right. If, through other means, police have narrowed down the suspect list to one man, then the probability that he’s guilty if his DNA matches would be 1 in 1.1 million. One would think that if his DNA matched, he’d be more likely to be guilty than if it didn’t match.

    Steverino (d6232c)

  90. Daryl,

    I disagree with some parts of your set up, but more importantly I think you need to re-run the numbers.

    P(KIB|DNA=1) = (.7354 * P) / (.7354 * P + .2260 * (1-P))

    Letting P = 6.25E-08 I get the followign answer,

    P(KIB|DNA=1) = 2.03374E-07.

    In other words, there is an infinitismally small probability that the killer in in the database. How you can conclude this is beyond a reasonable doubt is…well beyond me.

    Steve Verdon (4c0bd6)

  91. Steverino:
    You know, you’re right. I was heading off to lunch, and flipped a term. (Careless of me, I know.)

    P(G|M) = P(G)/P(M)
    P(G) would be 1/10, P(M) is 1/1.1 million
    The result would be 110,000, meaning that guilt is not just certain, it’s damn certain.
    (Somewhere in there, there’ll have to be a normalizing term to make the probabilities add up to 1.0, but I’m not going to look for it right now.)

    (Oh, all right. Maybe I will…)
    In terms of set diagrams, what we’ve done is set the statistical universe smaller than the number of possible matches. Since Bayes’ Theorem assumes that {G} and {M} are both subsets of {U}, the universal set, when we shrink {U} to the point where {M} is extending outside it, the math blows up in our faces.

    Karl Lembke (ff486c)

  92. Remember, this assumes the statistical universe is the global population. If we restrict the universe, P(G|M) drops.

    If our universe is the US population of 300 million, then P(G|M) = 1.1 million / 300 million, or 1/272.

    If other considerations allow us to narrow the statistical universe to ten people, then P(G|M) becomes 10/1.1 million, or 110,000 to one.

    So the question becomes, what is your statistical universe, and how have you narrowed it down, if indeed, you have?

    I quite agree. Alot hinges on the prior probability of guilt here.

    Although I do wonder,

    P(M) = 1/1.1 million

    Why, when your earlier analysis showed that the probability of one or more matches in the database is 0.2646? Keep in mind the relevant question is “What is the probability of guilt given that we have gotten one (or more) hits off of the database.” Not, “What is the probability of guilt given a hit off of just one trial.”

    Steve Verdon (94c667)

  93. To be clear the 1/1,100,000 is the probability if you take one person, test them against the DNA sample and get a hit. And then stop. You test nobody else.

    This is NOT what the police did. They, in effect, did 338,000 tests and then looked over all the answers. That is they didn’t stop when they got a match, they kept right on going right down to the bottom of the list. A very different proposition. Hence I think using hte P(M) = 1.1 million is overly low.

    Steve Verdon (94c667)

  94. Although I do wonder,

    P(M) = 1/1.1 million

    Why, when your earlier analysis showed that the probability of one or more matches in the database is 0.2646?

    Different questions.

    What is the probability that any one person, selected at random, will match the DNA sample?
    P(M) = 1/1.1 million

    What is the probability that at least one person in a randomly selected group of 338,000 people will match the DNA sample?
    26%.

    Keep in mind the relevant question is “What is the probability of guilt given that we have gotten one (or more) hits off of the database.” Not, “What is the probability of guilt given a hit off of just one trial.”

    Then we need to be very careful that we ask precisely that question, which means being very careful how we set up the statistical calculations.

    Karl Lembke (ff486c)

  95. Karl,

    You should go to lunch….and me too.

    Steve Verdon (94c667)

  96. You know, I could have lots of fun with the way you phrased “the relevant question”.

    “What is the probability of guilt given that we have gotten one (or more) hits off of the database.”

    Well, I’d say the probability of guilt is 100%. Someone has to be guilty of the crime, no matter how many hits we get off the database.

    If we get more than one hit, the probability that all of them are guilty is probably very close to 0%. Maybe they were accomplices, but there seems to be no indication there was more than one perpetrator.

    I’d say the relevant question is, indeed, P(G|M): “What is the probability that a person is guilty, given that his DNA matches the sample?”

    And a side issue is, was the prosecution justified in throwing that one-in-1.1 million figure in front of a jury?

    Karl Lembke (ff486c)

  97. Steve,

    I grabbed some lunch. But I think I will go back to the going-away potluck for my co-worker and be sociable.

    Later!

    Karl Lembke (ff486c)

  98. Letting P = 6.25E-08 I get the followign answer,

    P(KIB|DNA=1) = 2.03374E-07.

    Steve, P is the preexisting probability that the killer is in the sex offender database.

    If you think the DNA database was drawn from California men at random, and that the true culprit is one of those California men, then P would be .02

    You started with a .00000625% chance that the killer would be anywhere in the DB. (That’s not a per person chance, P represents the chance that the killer is anywhere in the DB.)

    For example, if you thought there were 50/50 odds the killer would be in the DB, P would be .5.

    If you thought there was only a 10% chance the killer would be in the DB, it’s .1.

    Even if you thought the DB was comprised of 338k people taken at random from around the world, the probability that the killer is in the DB is .005% (338k/6BN)

    The formula worked the way it’s supposed to. You put in a ridiculously low value–garbage in, garbage out.

    Daryl Herbert (4ecd4c)

  99. And a side issue is, was the prosecution justified in throwing that one-in-1.1 million figure in front of a jury?

    Yes. You must compare its probative value to the possibility of prejudice.

    Probative value: if you start from the premise that any man in California could have committed the crime, the odds of guilt look like 1:16M. Once you factor in that the odds of a match are 1:1.1M, that means that the odds of guilt are more like 1:16. That’s extremely probative.

    Prejudicial Possibility: the “prosecutor’s fallacy” could be used to convince jurors there is only a 1-in-1.1M chance he is innocent. This is wrong and bad and any judge who allows a prosecutor to get away with saying such a thing in this day and age should be tarred and feathered. But that just means we need smarter judges, not that we should do away with DNA evidence.

    Perhaps it would be better if the prosecution was only allowed to tell the jury that, based on this DNA, it is unlikely that more than 30 men who lived in California at the time had those particular 5.5 markers. That would be a better way to introduce the DNA evidence, without risking the prosecutor’s fallacy.

    Daryl Herbert (4ecd4c)

  100. Xrlq @ 82: very, very helpful. That will help me draft a shorter e-mail

    James B. Shearer @ 86: you are right. I didn’t think that one through well.

    Patterico (4b2d82)

  101. Daryl,

    You wrote,

    I used prior odds of 1:16M and still came to the conclusion that Mr. Puckett was, beyond a reasonable doubt, guilty.

    I took that to mean 1 in 16 million. Now you want it to be 0.02, that is there is a 1 in 5 chance the killer was in there?

    The formula worked the way it’s supposed to. You put in a ridiculously low value–garbage in, garbage out.

    No, I just went with what you wrote. I’ve gone back and looked at your post, #84, twice now and can’t fine the 0.02 and your reasoning for why you want to use it. Did you, perchance, divide 338,000 by 16 million?

    Oh, and using 0.02 as the prior and your calculations, I get a probability that he killer is in the database of 0.0656. Better, but I wouldn’t call that beyond a reasonable doubt either. Maybe you want a prior of 0.9 which then gives a result of 0.967. Is that beyond a reasonable doubt? I dunno.

    Yes. You must compare its probative value to the possibility of prejudice.

    Assuming the context put forward in the article is correct, no. The reason is, if you go out grab one person at random and test them and get a success then that probability is 1/1.1 million. But the state did not so that. They went out and, in effect, tested 338,000 people none of whome they thought is any more likely than another in the database to be guilty. Not providing a complete picture is misleading.

    Steve Verdon (94c667)

  102. Correction:

    that is there is a 1 in 5 chance…

    That should read:

    that is there is a 1 in 20 chance….

    Steve Verdon (94c667)

  103. I took that to mean 1 in 16 million. Now you want it to be 0.02, that is there is a 1 in 5 chance the killer was in there?

    Oh golly.

    I never said P was 1:16M.

    The probability that Mr. Puckett is guilty, before we do DNA testing or look at any other evidence, let’s call that G. That could be 1:16M.

    The probability that the true killer is in the database, before we do DNA testing–let’s call that P. That is NOT going to be 1:16M.

    If we assume the killer has just as much chance of being in the DB as any other California man, then P = .02. That is to say, 2%, or one in fifty, not one in five. That figure is derived by dividing 338k/16M. 338k men in DB / 16M men in CA = 2% of men in California are in the DB.

    If we assume that, because the killer is probably a sex fiend, he has a greater chance of ending up in the DB for having been caught for other sex crimes, we can use a higher value for P. How high? I don’t know. There’s no easy/obvious way to decide how likely we think it is that the perpetrator of a decades-old rape/murder will be in the DB as a result of being caught for other crimes.

    AFTER we know that the database returned exactly one hit, AND that that hit was Mr. Puckett–AFTER that event takes place–then we can safely say that G is true if and only if P is true. Don’t confuse those two variables.

    Oh, and using 0.02 as the prior and your calculations, I get a probability that he killer is in the database of 0.0656. Better, but I wouldn’t call that beyond a reasonable doubt either.

    First, re-read my post #14. That explains my numbers. In fact, it even contains the 6.6% figure.

    How can a 6.6% chance of guilt be guilty beyond a RD? The simple answer is: it’s not. Not by itself.

    But when you combine it with other evidence in the case it is damning.

    The 6% figure may seem small, but before we did the DNA testing, we thought the probability of guilt was 1:16M. The 6% figure represents a leap of about 1M:1 in terms of probative value.

    Those are 1:15 odds that Mr. Puckett is guilty.

    ASSUMING that the DNA database is entirely randomly composed (which is a bad assumption, IMO, that undercounts the strength of the prosecution’s evidence)

    P(G|DNA=1) == 6.6%

    But there are 2 other bits of evidence: the distinctive MO and the fact that he was in the vicinity at the time.

    Let’s call those:
    P(VIC) (probability that any California man would be in the very general vicinity around the time of the murder) = .05

    That is to say, a 5% chance any Californian would have been in/around SF that night. That’s probably high. About 2.5% of California’s male pop currently lives in SF (I realize I’m using today’s statistics rather than 1972s, but I’m lazy).

    P(VIC|G) = 1.0.

    That is to say, the probability that a man would be in the general vicinity, if he was guilty, must be 100%. He couldn’t have raped/murdered her if he wasn’t there.

    P(G) = we go into this using our prior probability of guilt, the 1:15 odds.

    P(G|VIC) = this is what we want to know

    P(G|VIC) = 1.0 * 1:15 / .05 = 20:15 = 4:3

    So now we think it’s slightly more likely that he’s guilty than not.

    How about the fact that he has a distinctive MO? Let’s say only 5% of sex offenders have that MO. Let’s say there’s a 50% chance the victim was killed by someone with that MO, and a 50% chance that the rape/murder just happened to go down in the manner it did even though the real killer didn’t have that particular MO.

    P(G) = 4:3
    P(MO) = .05
    P(MO|G) = .5
    P(G|MO) = what we are solving for, the probability of guilt updated for taking into account the MO

    P(G|MO) = .5 * 4:3 / .05 = 40:3 = approximately 13:1

    That undercounts the chance of guilt, because it’s based on the idea that the sex offender DB is drawn completely at random from California’s population, so we haven’t factored in that Mr. Puckett is a sex offender who was previously caught by the police. That’s a huge, huge thing missing from the analysis.

    If we had previously assumed higher odds that the true culprit would be in the DNA database, based on the fact that it’s a DB of sex offenders, we could not do this next step, because it would be double-counting. We would be counting Mr. Puckett’s sex offender status against him twice, which is completely improper math-wise.

    P(SO) = chance that any random Californian is a SO caught by the police = .02
    P(SO|G) = chance that the guilty man is a sex offender known to police = this is identical to the variable “P” discussed above.
    P(G|SO) = this is what we want to know.
    P(G) = the prior probability of guilt was 13:1 odds in favor.

    P(G|SO) = P * 13:1 / .02 = 650 * P : 1

    Note that 0 > P > 1, so the odds are worse than 650:1

    Even if you think there is only a 10% chance the true killer is an SO known to the police (and therefore, in the DNA database), that still gives odds of guilt at 65:1. For a lot of people, that will satisfy “beyond a reasonable doubt.” I think it’s probably higher than that.

    I do think Mr. Puckett is guilty beyond a reasonable doubt. I’ve drawn all inferences in his favor and still come up with damning numbers.

    Daryl Herbert (4ecd4c)

  104. 76

    “If a match comes up in a database of known offenders and you work your way through a mountain of leads…and each lead continues to point toward that individual…making a stronger and stronger case…why would it matter that you hit on the DNA markers first, second, eighth, twentieth or 12758th in your order of leads?”

    Practically it does make a difference. People are not completely objective, they interpret their observations of the world according to a preexisting framework. So for example if they are Republicans they are more likely to believe stories of Democratic wrongdoing than of Republican wrongdoing (and of course vice versa if they are Democrats). Here if you start with a DNA hit this is likely to bias the following investigation especially if the investigators give the hit excessive weight. The investigators may focus more on evidence pointing towards guilt than towards innocence. For example if the MO is similar to previous attacks by the suspect but the physical appearance of the victim is different they may only note the similarity in MO. Or if they find the suspect lived 14 miles from the scene of the crime they may decide that less than 15 miles is near whereas if the suspect lived 4 miles from the crime scene then perhaps less than 5 miles would become the definition of near. It is very hard to avoid this sort of bias and this makes circumstantial evidence gathered after the hit somewhat unreliable. On the other hand if the DNA test comes last it should still be completely objective, the results should not depend on how certain the investigators are that the suspect is guilty. Even here care should be taken, you don’t want the lab to know what the expected result is.

    James B. Shearer (fc887e)

  105. I think in terms of guilt or innocence, Daryl has it about right, though I haven’t read through the entire post.

    Let’s go back to the Times piece. It states:

    Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database seach had hit on an innocent person.

    Note: Not “the probability that the person found in the database search was innocent”. Rather, it was the probability (to phrase it better than the reporter did), that a search of a database of that size would score a hit on an innocent person.

    Jurors are often told that the odds of a coincidental match are hundreds of thousands of times more remove than they actually are, according…

    That’s what I’ve been looking at. That’s what I took Patterico’s doubts about “the formulation” as referring to.

    In cases where the odds against a match to one person are quadrillions to one, and there’s only one chance in a million of anyone else on the planet matching a sample, I have no problem with calling a DNA match a “lock”. But when a match is small compared with the world population, or the national, state, or county population, it’s not, and I, for one, would object to anyone tossing around numbers that make it sound like it is one.

    And at this point, I think we’re going over old ground. I may look in later, but I’m not sure if I’ll find anything else to comment on.

    Karl Lembke (ff486c)

  106. Daryl,
    You are assuming the murder was committed by someone resident in SF at the time. While this is more likely than not we still cannot rule out the possibility that it was done by someone only passing through.

    The distincive MO is also likely to be overweighted by confirmation bias.

    But the main mistake is that you are using the probability of him being a sex offender caught by police as well as his chance of being in the data base. This is double counting and increases your calculated probabilities too much. The information you should be using is how much does being a sex offender increase his chance of being in the data base. If you use realistic chances of the suspect being in the data base we still get substantial probabilities of innocence.

    Lloyd Flack (ddd1ac)

  107. The probability that we have the right person is P(G | DNA =1) = P(KIB) x P(FP = 0) / (P(KIB) x P(FP = 0) + (1 – P(KIB)) x P(FP = 1)). This does not increase linearly with P(KIB) the probability that the killer is in the database. What you are doing will lead to an overestimate of the probability of guilt as we increase the probabilty that the data base includes the killer.

    Lloyd Flack (0c6a49)

  108. If you want to correct the LA Times article, correct their arithmetic.

    “In every cold hit case, the panels advised, police and prosecutors should multiply the Random Match Probability (1 in 1.1 million in Puckett’s case) by the number of profiles in the database (338,000). That’s the same as dividing 1.1 million by 338,000.”

    1 divided by 1.1 million times 338,000 is not the same as dividing 1.1 million by 338,000!

    Skeptic (9a4a22)

  109. Heh. That’s a very good point. How did I miss that?

    Patterico (4bda0b)

  110. No it’s not. It’s a totally crappy non-point, and you were right to miss it on the first go. The odds of a single test yielding a false positive are 1 in 1.1 million. The odds of 338,000 tests collectively yielding a false positive are (1 x 338,000) in 1.1 million. Which is indeed the same as (1 x 338,000 / 338,000 = 1) in (1.1 million / 338,000).

    Xrlq (62cad4)

  111. X,

    Maybe I’m missing something.

    Shouldn’t they have said “That’s the same as dividing 338,000 by 1.1 million” and not the other way around?

    After all, they end up with a roughly 1 in 3 chance.

    Patterico (4bda0b)

  112. Put another way, you say:

    Which is indeed the same as (1 x 338,000 / 338,000 = 1) in (1.1 million / 338,000).

    But 1 in (1.1 million/338,000) is the same as 1 divided by (1.1 million/338,000) — which I think still ends up being 338,000 divided by 1.1 million. And not the other way around.

    Correct me if I’m wrong.

    Patterico (4bda0b)

  113. #104. James B. Shearer:

    Practically it does make a difference.

    Not really. For a criminalist, DNA is a means to identification, nothing more. Its a tool used to exclude people from consideration. A five loci match cannot exclude as many people as a 13 loci match…which is why criminalists continue to use other methods to refine their identification, and why I find the discussion rather specious to begin with.

    As I mentioned a few days ago, a criminalist with a sample DNA fragment is going to run a database query just as he/she would if they had a surveillance photo of a suspect. Or a partial fingerprint. If, and only if, there is a possible match, will it be further investigated, and only at that time does the reliability and the confidence of the possible identification come into question.

    EW1(SG) (84e813)

  114. No, strictly speaking you’re right. However, I think that’s the point the author was trying to make in the first place, albeit perhaps a bit less artfully. I read Skeptic’s comment as suggesting something else.

    Xrlq (62cad4)

  115. Daryl,

    You have a problem with your calculations.

    You state that P(V) = 0.05 however that is too low based on your other probabilities you use. This result follows from the theorem of total probability which holds, in this case,

    P(V) = P(G)P(V|G) + P(~G)P(V|~G).

    However you also have asserted that P(V|G) = 1 and that P(G) = 0.066. Hence you are asserting that

    P(~G)P(V|~G) = -0.016 which clearly cannot happen since both P(~G) & P(V|~G) are in the closed set [0,1].

    This highlights a bigger problem with Daryl’s posts. These numbers aren’t just pulled out of one’s ear. They have to make sense. Eliciting prior probabilities in Bayesian analysis is not as trivial an exercise as Daryl makes it out to be.

    No it’s not. It’s a totally crappy non-point, and you were right to miss it on the first go. The odds of a single test yielding a false positive are 1 in 1.1 million. The odds of 338,000 tests collectively yielding a false positive are (1 x 338,000) in 1.1 million. Which is indeed the same as (1 x 338,000 / 338,000 = 1) in (1.1 million / 338,000).

    I think this is not quite correct and this is also a problem in the original article. The chance of one or more hits is close to 1 in 4. However, that is merely the chance of a hit, it says nothing about guilt of the person’s whose name pops out. For that you need to use Bayes theorem.

    I urge everyone still following this discussion to go here and read the article. It is about how to determine if a woman has breast cancer given that a test came back positive for breast cancer. Here is the nut of the problem,

    1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

    The problem is very similar to the one we are dealing with here, but with the exception that all probabilities are known. One might be tempted to look at such test results and say, “Crap I (my wife) has breast cancer.” However that would be wrong if you are looking at the 80% number. The actual probability that the woman with a positive test has breast cancer is 7.8%! Ten times lower than what most people think is the actual answer.

    Probabilistic reasoning is often counter intuitive. Look at the Monty Hall problem. People are still arguing about that one and there are mathematical proofs out there as well as different computer codes that show “switching” is always a better strategy than “sticking”.

    Simply because the answer is counter intuitive or doesn’t strike you as right doesn’t mean it is wrong. There are other paradoxes in statistics too. Simpson’s Paradox, the Allais Paradox. the Ellsberg Paradox, and then there is the work of Daniel Kahneman about how people routinely flub probabilitistic/decision making under uncertainty and behave “irrationally”–usually a violation of Bayes theorem or some other theorem in probability theory.

    Steve Verdon (284bc5)

  116. Thank you for the link, Steve; that’s a fantastic explanation. (7.763+ to two digits, 7.8 🙂 )

    htom (412a17)

  117. I apologize for not responding sooner, but I was not permitted to do so. (No, I was not grounded by my parents–I don’t want to compromise my pseudonym.)

    —-

    If P(V) means the probability that any California man would be in the vicinity, then you get:

    P(V) = .05

    P(V|G) = 1.0
    P(G) = 1/16M

    P(V|NG) = .04999994
    P(NG) = (16M – 1) / 16M

    The numbers add up. Your mistake is using the probability of 1 man’s guilt when I’m talking about the probability that any California man would be in the vicinity. I was most certainly not talking about the “probability” that Mr. Puckett was in the vicinity, which I assumed to be 1.0.

    —-

    I was using P(V) to mean the probability that Mr. Puckett was there. It was presumptuous of me to assume a 100% chance, despite the fact that he seems to have admitted to being there and the police have strong evidence placing him there. There’s always a chance everyone is wrong.

    He apparently admitted that he was there, which he probably would not have done if it wasn’t true. Let’s say there’s a 1/20 chance, if he was innocent, that he was wrong about being there (despite the other evidence showing he was there).

    Let’s drop P(V) to .95. We still end up with 618 * P : 1 odds of guilt, which are beyond a reasonable doubt, IMO.

    P is the likelihood that the person who raped and murdered Diana Sylvester was later arrested for any felony (not just any sex crime, because I made a mistake as to the information contained in the DB) and had his DNA collected as a result. That suggests P should be higher than previously thought. A P of .5 seems reasonable. Someone who would commit a rape/murder seems like the kind of screw up to end up with a felony record, even if the later felonies weren’t sex crimes. Even a P of .3 would produce odds of about half a percent that Mr. Puckett is innocent.

    My numbers add up. He is guilty beyond a reasonable doubt, to my subjective and personal understanding of what “beyond a reasonable doubt” means.

    Daryl Herbert (452002)

  118. The numbers add up.

    No, they don’t. Your numbers, taken in conjunction violate the theorem of total probability. Thus either P(V) is too low or P(G) is too high. There is no room for doubt about this. Your numbers just don’t add up and as such your claims of guilt beyond a reasonable doubt are now in doubt.

    Your mistake is using the probability of 1 man’s guilt when I’m talking about the probability that any California man would be in the vicinity.

    I made no mistake. I merely applied the theorem of total probability to your set up. To be specific the theorem of total probability is for disjoint events B and ~B,

    P(A) = P(A|B)*P(B) + P(A|~B)*P(~B).

    Now you want to change your supposedly inviolate numbers because you’ve been caught in a mistake.

    Using your numbers,

    0.05 = x*0.066 + P(V|~G)*P(~G)

    Now for this equality to hold it must be the case that 0.05 – x*0.066 > 0. That in turns implies that P(V|~G)*P(~G) > 0.7576. And for this to be the case you have to have the following probability, P(V|~G) > 0.8111.

    I’m sorry, your made up out of cloth priors stink. I think I’d rather leave them to an expert which is not you.

    Oh, and regarding this,

    P is the likelihood….

    You do know the difference between probability and likelihood don’t you?

    Steve Verdon (94c667)

  119. Correction to the following:

    That in turns implies that P(V|~G)*P(~G) > 0.7576. And for this to be the case you have to have the following probability, P(V|~G) > 0.8111.

    Should read, P(V|G) < 0.7576 and for this to be the case it must be that P(V|~G) < 0.06.

    I was using P(V) to mean the probability that Mr. Puckett was there.

    Not in your initial calculations. In your initial calculations you used 0.05. Now you want to backpedal and try to claim you used 1? Please, your calculation was,

    P(G|VIC) = 1.0 * 1:15 / .05 = 20:15 = 4:3

    And your 1:15 is wrong too. It should be 1:14.

    And the correct calculation is,

    Odds(G|V) = Odds(G) *[Pr(V|G)/Pr(V|~G)].

    Why you have 0.05 in the denominator in your initial calculation is not clear to me.

    Steve Verdon (94c667)


Powered by WordPress.

Page loaded in: 0.1112 secs.