Claim-Check Experimentation

#1
Here are some doubts regarding the claim-check experimentation, that came to my mind while using the prototype:

1) For the claim-check-tests I understand we will pick a fact-check article, and based on that, each one of us will write its own claim-check article. I was thinking about this, and I guess it would be important collecting and analyzing data on the articles that we wrote themselves, as well as the argument maps that each of us created. I'm still not exactly sure about how this should be done, or what exactly should be measured there, but the thing is that someone could arrive to (let's say) a very similar claim confidence level, but having done a rather different analysis (i.e. argument map) behind the scenes. I don't know if this is too obvious and something you (@Jack Harich ) were already expecting, or something that you don't consider necessary, but in my opinion, producing similar argument maps is part of producing an accurate measurement. What do you think?

2) It is very likely, that at the beginning you (@Jack Harich) will be an outlier, regardless of the protocol we're using, simply because of your experience, and as you've said, you already think very much in terms of Structured Argument Analysis. It would probably make sense that we all practiced together with you the process of claim-checking an article from scratch, before starting the actual experimentation, to level up a little the starting point from everyone and avoid big differences due to very basic mistakes. Is this something you were already considering? If not, do you think it's necessary?
 

Jack Harich

Administrator
Staff member
#2
Wow, some super insightful questions here! This is going to be a fantastic experiment.

1. "I guess it would be important collecting and analyzing data on the articles that we wrote themselves, as well as the argument maps that each of us created. I'm still not exactly sure about how this should be done, or what exactly should be measured." - I"m not sure either, but this is normal in pioneering, complex experimentation. As long as we have clear goals in terms of the hypotheses we want to test, improvement cycles will lead us to that goal.

For the first experiment, each cycle consists of pick a claim in a fact-check article, then each of us creates a claim-check article for that claim, then we meet to compare results, especially our claim truth confidence levels. The goal is to get variation down to a low asymptotic, perhaps a 15 percentage point range. In our discussions we will discover what factors cause excessive variation. The biggest, I suspect, is we each followed different procedures. That can be greatly reduced by following standard procedures by improving the protocols.

From our meeting notes of June 17 we had this hypothesis: The truth confidence level of a claim can be accurately measured by use of the right tool, such as Structured Argument Analysis. That this hypothesis may be false is a huge project risk, as discussed on the project page under Risk Management. Read that closely. These concepts will, I suspect, drive the experiment and your paper. Of course these concepts will evolve. I'd guess that our experiment design will not be stable until we've done several or even many (more than 10) cycles because there's so much discovery going on. This is normal.

1. "producing similar argument maps is part of producing an accurate measurement" - Gee, had not thought about that. There are many roads to Rome, so it may be that claims can be checked by different arguments. But I tend to agree with you. Different arguments, expecially radically different ones, arriving at about the same confidence level are suspect. The agreement could accidental. Again, this is something that we will work out in the discovery phase of the experiment.


2. It is very likely, that at the beginning you (@Jack Harich) will be an outlier, regardless of the protocol we're using, simply because of your experience, and as you've said, you already think very much in terms of Structured Argument Analysis. It would probably make sense that we all practiced together with you the process of claim-checking an article from scratch, before starting the actual experimentation, to level up a little the starting point from everyone and avoid big differences due to very basic mistakes. Is this something you were already considering? If not, do you think it's necessary?

Hadn't thought of it past the obvious need for training and practice before we get serious about experiment cycles. But even then, after that I may still be an outlier. Or I may not, because I will probably to way to focused and distracted by software development, which requires huge amounts of time.

We won't know how much time we each need to practice to get comfortable with the tool until we try it. "Getting comfortable with the tool" then is a project milestone.

Do these answer help?
 
#3
Thank you Jack.

1. I'll have to make sure that we really have clear hypotheses, and that we don't confuse testing for intercoder reliability and the reliability of the tool itself (Structured Argument Analysis). We need to test for them separately, but both are necessary to be able to ensure that the truth confidence level of a claim can be accurately measured by use of the right tool.

2. I do believe we'll have to develop the protocols on the one hand, and a standardized training for claim-checking on the other. With practice we'll get better, but when we get to the point to introduce the tool to others, I guess we'll need a practical training to prepare them to use it.
 

Jack Harich

Administrator
Staff member
#4
Excellent. I really like your sharp, structured thwinking here. And I can start to see that you, the research/experimentation specialist, are starting to thwink deeper than I have in a productive direction. That's a really good sign of a well rounded, smoothly working team.

Go team!
 

Jack Harich

Administrator
Staff member
#5
Below are the questionnaires I did last year. The purpose was to test my assumptions of the power of claim-checks (CC) and Politician Truth Ratings (PTR) to be effective and for PTR to have more impact that fact-checks (FC). It's easy to base one's work on false assumptions and I didn't want to fall into that trap.

The attachments are Excel spreadsheet files. Each file contains the results of running an online questionnaire using Prolific for the panel of participants. The questionnaire software used was Simfatic. The file numbers range from 01 to 06. A total of 6 questionnaire experiments were done. I made an error on experiment 4 and tossed out its results.

The questionnaires contain an unusual amount of open ended questions. These are a gold mine for insights into why people choose their answers for multiple choice questions. These files are fascinating and highly educational reading, at least for those interested in this line of research. I suspect that because the subjects know they will have to take the time to explain their multiple choice answers in detail in the open ended questions, the answers themselves are likely to be better quality, i.e. closer to what the subjects would do in a similar real world situation.

Here are the questionnaires for:
- Experiment 01 (Test on four friends)
- Experiment 02 (5 participants. FC vs PTR, Candle. Questionnaires 01 and 02 are identical.)
- Experiment 03 (10 participants. FC vs PTR, Candle. Moved FC image to earlier question.)
- Experiment 05 (10 participants. FC vs PTR. Added features list of FC vs PTR.)
- Experiment 06 (5 participants. FC vs PTR. Removed features list, added summary and new graphic.)

The first file contains the results of experiments 01 and 02. Experiment 01 (rows 3 to 6) was for testing on people we knew in order to see if the questionnaire had any problems. It didn't. The lower section of the spreadsheet (rows 8 to 12, 5 participants) contains the results for experiment 02. The outstanding result was a strong preference for PTR over FC and that people would really like to use Candle. But 5 is a small sample.

The second file contains the results of experiment 03. Sample size was increased from 5 to 10. The astonishing result was no significant preference for PTR over FC. Why was this? Reading the open ended question answers, I decided the probable reason was I had failed to describe the differences between FC and PTR clearly. So I added a features list to the next experiment as shown below:



The third file contains the results of experiment 05. The results were as expected. Now there was a strong preference for PTR over FC. The choices were 0, 0, 1.5, 1.5, 1.5, 2, 2, 2, 2, 2. 0 is not sure, 1 is prefer FC, 1.5 is about the same, 2 is prefer PTR. No one preferred FC. 2 were not sure. 5 preferred PTR.

But perhaps people simply preferred PTR because it had more features than FC, i.e. "I'm smart. I should prefer the one with the most features." So two months later, after creating the "issues graphic for PTR" I changed the questionnaire to not have the features list. Instead, it had a summary of FC vs PTR and the new graphic. This is the same two PTR images seen on the Truth Ratings System Ratings page. Here's the summary:

A. While fact-checks are selected on the basis of what would make a good story, claims for Politician Truth Ratings are selected randomly. This avoids bias and allows a scientifically rigorous measure of a politician's tendency to tell the truth.

B. Fact-checks provide an isolated measure of the truth. Politician Truth Ratings provide a complete measure, one that tells the whole story instead of just part of it.

C. Informed citizens need to know where politicians stand on certain issues and how truthful they are on the issues. To support this, Politician Truth Ratings organizes the claims used in a rating into the top issues, as shown below.
The fourth file contains the results of experiment 06. The results were positive. They showed that people strongly preferred PTR over FC, based on a summary of the differences, a graphic of what two FCs look like, and a graphic showing what two PTRs looked like. 1 person preferred FC, while 4 preferred PTR. The open ended question answers showed these choices were due to correct understanding (in their own words, which is high validation) of what I feel are the advantages of PTR over FC.

My final conclusion was that it was safe to proceed with the assumption that voters would be positively influenced by PTR, if we could build a software system that would make that possible. That became my next large task.

Another conclusion was that properly worded open ended questions can provide extremely insightful results. Earlier testing revealed that just asking "Please explain your answer to so-and-so?" led to low quality answers, like "Because it was obvious." It was not until I changed to the format to "Please write a thoughtful paragraph explaining why you chose so-and-so. This is extremely valuable information for helping us design better quality of information for voters. Thanks." that I started to get high quality answers to the open ended questions.
 

Attachments

Last edited:
#6
Jack, thank you for sharing your findings from the questionnaires! It was really interesting seeing the results you got!
When I first read them I took a couple notes, that I hadn't share. Here they are:
  1. Maybe it would have been interesting asking for the respondent's gender.
  2. Often, a scale of the political spectrum in which respondents place themselves, is used to ask for political orientation. That can prove to be more accurate.
  3. In the question about satisfaction with the quality of information available, the word "somewhat" may be better than "slightly" for the response options. Some type of horizontal scale would be another option for responding this question.
  4. It was a good idea to start with the fact-checks before moving on to the PTRs. Another idea would have been comparing fact-checking to PTR at the beginning, and then asking the questions. That has the benefit that it makes a 1:1 fair comparison, but has the risk of people forgetting or confusing details.
  5. The tone of the questionnaire shouldn't take a side. Phrasing the explanation of PTRs as "We have developed a new form of information", or describing what it does as "measuring the quality of something so you don’t have to do all that work yourself" could be considered as taking a side. That can bias the responses in favor of PTRs.
  6. In general, the explanation of what fact-checking is, and what PTRs are must be the minimum required for the respondent to know what they are. Usually, once the explanations surpass a certain length, it's just too much mental effort, and people start skipping parts. I noticed the explanation of PTRs was shorter in the latter questionnaires, great! I think the best one was the 5th one, I understand you took that comparison table out because it was misleading, but the format (a different background color, a different arrangement of the information) made the whole explanation easier to take in.
  7. Even after the explanation of PTR was shorter, it was still much longer than the explanation of fact-checks. That's to be expected, because PTRs were a new concept for everybody, but to keep everything fair, an explanation more balanced in length would be ideal.
  8. The fact that some answers were quoting the previously given explanation (eg. "I can't explain it better than the reasons you gave") shows that it did make some people lean towards a preference for PTRs.
  9. Q10 of experiment 01 was too long and confusing. Less is more.
  10. Instead of asking which one (FC or PTR) would influence the most, one could have asked the respondents to rate somehow, how much each one would influence them.
  11. The question about how people personally go about deciding who to vote for had some insightful answers as an open question. In experiment 03 tough, I noticed there were different types of answers, which mean that people are understanding different things under the same question, or it is simply too broad. Not a problem for a small n, for a larger one, another option would be to ask respondents to rate how important different (given) aspects are for them in deciding who to vote for.
  12. Q12 of experiment 01 (Candle influence) asks the respondent to identify the "amount" of influence Candle would have for them. The word "amount" is way too imprecise an ambiguous and should be avoided. For one person a "medium amount" could be equivalent to another's "large amount", or so. Q7 is similar, more specific answer options could help with the accuracy.
  13. Asking for a paragraph is really clever, almost everybody followed the instructions and gave substantial answers.
Good! That's all I noted down! Again, it was really interesting diving into the first receptions of PTRs among the public! I think "truth" is indeed a recurrent aspect brought up by the respondents, and the PTRs have potential to have a positive impact on the democratic system at large.
 

Jack Harich

Administrator
Staff member
#7
I had a suggestion for what we used to call experiment 1 and now call study 1.

It's a study because it's not an experiment. It collects data related to a long succession of claim-checks done by 2 or more claim-checkers analyzing the same claim.

The idea is that it's possible to turn this into an experiment if we have two or possibly three groups:
  1. The first group uses Structured Argument Analysis to determine a claim's truth confidence level (CL).
  2. The second group uses the techniques fact-checkers are using now, which is intuition combined with procedures for doing good fact-checking, such as described in the Chicago Guide to Fact-checking.
  3. A possibility is to have a third group of people using no method at all, just their own personal intuition. This is a type of control group.
This gives us group treatments. Random assignment to groups is possible. However, the level of training is zero on group 3, medium on group 2, and high on group 3.

The advantage of an experiment versus a study is the experiment can test the hypothesis that claim-checking produces CLs with better accuracy and precision than fact-checking, which in turn does better than no formal method at all.

However, the experiment is much more complex and time consuming to run. It could be that we need to do study 1, which is very hard in itself, and then consider doing the experiment.