Spreadsheet Screw-ups (or bad choice of Excel for serious purposes)

Part 2:

In this case, it was the content more than any spreadsheet features themselves that indicated something was afoot:

The Anomaly: Strange Demographic Responses

As mentioned above, students in this study were asked to report their demographics. Here is a screenshot of the posted original materials, indicating exactly what they were asked and how:

We retrieved the data from the OSF (OSF | The Moral Virtue of Authenticity), where Gino (or someone using her credentials) posted it in 2015. The anomaly in this dataset involves how some students answered Question #6: “Year in School.”

The screenshot below shows a portion of the dataset. In the “yearSchool” column, you can see that students approached this “Year in School” question in a number of different ways. For example, a junior might have written “junior”, or “2016” or “class of 2016” or “3” (to signify that they are in their third year). All of these responses are reasonable.

A less reasonable response is “Harvard”, an incorrect answer to the question. It is difficult to imagine many students independently making this highly idiosyncratic mistake. Nevertheless, the data file indicates that 20 students did so. Moreover, and adding to the peculiarity, those students’ responses are all within 35 rows (450 through 484) of each other in the posted dataset:

So, it seems somebody did a copy/paste falsification of the data to make the results what they wanted… and they were super-lazy in falsifying.

Every “normal” observation is represented as a blue dot, whereas the 20 “Harvard” observations are represented as red X’s:

Indeed, we should note that while for all four studies covered in this series we found evidence of data tampering, we do not believe (in the least) that we’ve identified all of the tampering that happened within these studies. Without access to the original (un-tampered) data files – files we believe Harvard had access to – we can only identify instances when the data tamperer slipped up, forgetting to re-sort here, making a copy-paste error there. There is no reason (at all) to expect that when a data tamperer makes a mistake when changing one thing in a database, that she makes the same mistake when changing all things in that database.

3 Likes

I love the fact that her book is sub-titled thusly:

3 Likes

She must not have had the chapter - “How to hide your tracks when tampering with data”

2 Likes

part 3 is here:

We believe that these observations are out of sequence because their values were manually altered (after they were sorted) to produce the desired effect [4].

I did my podcast before they posted part 3:

1 Like

You should have released the parts out of sequence.

1 Like

And now Francesca Gino is now suing the Data Colada guys & Harvard.

Given at least one of their results, where it is pretty damn clear the data were tampered with and Harvard almost definitely has the original, untampered data… this seems very foolish on her part.

But hey. Maybe she’s trying to see if she can negotiate her way into a settlement w/ Harvard. I doubt the Data Colada guys will settle.

If Gino has tenure w/ Harvard, then maybe she does have an action, even if she did commit fraud.

Summary

THE NATURE OF THIS ACTION

  1. Plaintiff Francesca Gino (“Plaintiff” or “Professor Gino”) is employed by Harvard University as a tenured Professor at the Harvard Business School.
  2. Plaintiff is an internationally renowned behavioral scientist, author, and teacher. She has written over 140 academic articles, both as an author and as a co-author, exploring the psychology of people’s decision-making.
  3. Plaintiff has never falsified or fabricated data.
  4. In July 2021, a trio of professors and behavioral scientists (all male), Defendant Uri Simonsohn, Defendant Leif Nelson, and Defendant Joseph Simmons, who have a blog named “Data Colada,” (and who are collectively referred to herein as “Data Colada”), approached Harvard Business School with alleged concerns about perceived anomalies and “fraud” in the data of four studies in academic articles authored by Plaintiff.
  5. Data Colada threatened to post the “fraud” allegations on their blog, thereby subjecting Plaintiff, and by extension, Harvard Business School, to public scrutiny.
  6. Without Plaintiff’s knowledge, Harvard University and the Dean of Harvard Business School, Defendant Srikant Datar (“Dean Datar”), negotiated an agreement with Data Colada pursuant to which Harvard Business School investigated the allegations, in accordance with a new employment policy created solely for Plaintiff, in exchange for Data Colada’s silence during the investigation period. Unbeknownst to Plaintiff, Harvard Business School further agreed to disclose the outcome of the investigation to Data Colada, who could then subject Plaintiff’s work and professional reputation to public disparagement on its blog.
  7. Pursuant to its negotiations with Data Colada, in August 2021, Harvard Business School created the “Interim Policy and Procedures for Responding to Allegations of Research Misconduct” (“Interim Policy”) just for Plaintiff, which included a range of potential sanctions, including termination of employment.
  8. Under said Interim Policy, a finding of research misconduct required an investigation committee to prove, by a preponderance of the evidence, that Plaintiff “intentionally, knowingly, or recklessly” falsified or fabricated data, and to specify for each allegation the requisite intent.
  9. Under said Interim Policy, as with any other policy at Harvard, allegations were required to be made in good faith, and an investigation was required to be fair. Neither of those things happened in this case.

TL;DR the whole thing. One argument she’s using is that it wasn’t her who analyzed the data, so it’s unfair to say she committed fraud, it was the analyst. I don’t know how that works, legally. I mean, it’s her name on the manuscript so presumably she has some onus to ensure that the analysis was conducted properly. If multiple manuscripts have serious data errors (which appears to be the case), was it always the same analyst(s)? If she’s working with different people, seems unlikely that in each case that analyst is fudging the data.

Yeah I searched for “assistant” in the document (as in research assistant)… it’s pretty clear that data was tampered with in many cases.

Maybe she didn’t know, but here’s a question: what happened to RAs who didn’t get interesting results?

2 Likes
  1. This statement was demonstrably and verifiably false. The data provided to HBS by Professor Gino’s former research assistant was in an Excel spreadsheet and was not the “original dataset” for Study 1. The “original dataset” for Study 1 had been collected on paper in 2010, a fact that was clearly documented in the original 2012 PNAS Paper.
  1. Data Colada also knew that the so-called “duplicate observation” was not evidence of tampering, as it was equally likely that the same index card was used for participants’ IDs or the research assistant who conducted the study entered the ID twice—an honest error.
  1. Importantly, Data Colada also acknowledged in this recent blog post that it knows that an author of a study is not always the person responsible for handling or collecting the data and, therefore, is also not always the person responsible for any resulting “data irregularities.” See id. Doubtless, as behavioral scientists at leading universities, Data Colada knew or had reason to know that Professor Gino works with research assistants and others and may not have been the person responsible for any perceived anomalies in studies she authored.

(https://twitter.com/danengber/status/1686851180771557376)

and more:

The Hartford, an insurance company that collaborated with Ariely on one implicated study, told NPR this week in a statement that it could confirm that the data it had provided for that study had been altered after they gave it to Ariely, but prior to the research’s publication: “It is clear the data was manipulated inappropriately and supplemented by synthesized or fabricated data.”

Ariely denies that he was responsible for the falsified data. “Getting the data file was the extent of my involvement with the data,” he told NPR.

It’s wild that they apparently used RANDBETWEEN(0,50000) to generate some of the samples for mileage driven.

2 Likes

So you’re saying that miles driven is not uniformly distributed between 0 and 50,000? Who would would have guessed?

1 Like

And hilarious that it apparently was some research assistant who didn’t even get their name on the paper(s) [maybe]

Maybe the RA was trying to see how obvious they could make their fraud before they got caught.

3 Likes

Literally nobody knows that distribution. Certainly nobody posting on this forum.

Sorry, I find this hard to believe. Weren’t there other papers that have data problems that are associated with this professor? Was the same RA involved in all of them too?

Okay, I’m going to give a preview of a podcast I’m planning on doing later in the week.

I was an RA for some physics profs, where I ran/adjusted the code (I think it was in C… but it could’ve been FORTRAN… I know lots of languages). As a result, I have official, peer-reviewed papers to my name. I did some adjustments to the code, made runs, and I made the one graph that appeared in the papers. And I got attribution. Those papers had 4 authors (including me).

Similarly, in math, if an RA contributed to a paper, they got attribution.

But I know that’s not the case in many fields. Especially social sciences.

Frankly, I could totally believe both Ariely and Gino don’t have enough technical know-how to tamper with the data well enough to get the results they needed. Hell, I wouldn’t be surprised that neither of them did any statistical analysis at all – that was the RA’s purview, after all. They do the number-crunching, give you the p-hacking results, and then the profs do the the paper write-ups.

They may have directed an RA to make changes to get the results they wanted. I highly doubt they said anything that crass. They could have “encouraged” [wink wink] to improve the results.

Or they could have rewarded the magical RA who got them the awesome results/analysis they had been looking for.

And for all I know there was more than one, but the magical RA may not have been feeling too generous in explaining how they got their magical results, because this is a “tournament” career, where there are few slots for people to take. They would not get rewarded for helping fellow RAs.

1 Like

I suppose I am just a sweet summer child. I can barely wrap my head around that possibility.

https://twitter.com/owenboswarva/status/1688975359604101129
Catastrophic PSNI blunder identifies every serving police officer and civilian staff, prompting security nightmare https://belfasttelegraph.co.uk/news/northern-ireland/catastrophic-psni-blunder-identifies-every-serving-police-officer-and-civilian-staff-prompting-security-nightmare/a1823676448.html?=123… #FOI #dataprotection Spreadsheet disclosed via

@WhatDoTheyKnow

earlier today (now removed) https://whatdotheyknow.com/request/numbers_of_officers_and_staff_at

There are a tiny number of individuals whose unit is given as “secret”. But although that does not disclose precisely what they do, it marks them out as operating in an acutely sensitive area – and then gives their name.

There are details of the specialist firearms team, of riot police – the TSG unit – and the close protection unit which guards senior politicians and judges.

There is even a list of people responsible for “information security”.

It is a breathtaking exposure of PSNI secrets.

One former senior PSNI officer told the Belfast Telegraph that it was “astonishing” and a “huge operational security breach”.

“This is the biggest data breach I can recall in the PSNI,” he said.

This is not a new issue – I’ve used this example of gene names getting converted to dates in Excel in SOA presentations and articles going back to 2016, iirc.

Guest post: Genomics has a spreadsheet problem – Retraction Watch>

Gene-name errors were discovered by Barry R. Zeeberg and his team at the National Institutes of Health in 2004. Their study first showed that when genomic data is handled in Excel, the program automatically corrects gene names to dates. What’s more, Riken clone identifiers – unique name tags given to pieces of DNA – can be misinterpreted as numbers with decimal points.

In 2016, we conducted a more extensive search, showing that in more than 3,500 articles published between 2005 and 2015, 20% of Excel gene lists contained errors in their supplementary files.


In response to these issues, the HUGO Gene Nomenclature Committee (HGNC) has made changes to some gene names prone to errors. This includes converting the SEPT-1 gene to SEPTIN1 and the MARCH1 gene to MARCHF1.

The widespread adoption of these new names is a lengthy process: Our latest study, of more than 11,000 articles published between 2014 and 2020, found that 31% of supplementary files that included gene lists in Excel contained errors. This percentage is higher than in our previous 2016 study.

So, they did a study, and found about 20% had the problem.

They tried to fix the problem - publicize it, fix the names of some of the genes, etc.

Redid the study… 31% had the problem.

Yay.

2 Likes