Spreadsheet Screw-ups (or bad choice of Excel for serious purposes)

I’m going to guess a formula error in this one

It’s back to the drawing board for the plan to reduce speed limits to 30kph across Wellington, after one of the city council’s own discovered a serious error in the council’s cost-benefit analysis.

The mistake – first spotted by councillor Tony Randle – meant the benefits of reducing the speed limit in terms of reducing crashes was overstated by more than $250 million.

The Wellington City Council confirmed there was a significant error in the papers out for public consultation since May 24. Consultation would be halted and submitters would be contacted with apologies as the plan went back to councillors in August, it said.

The mistake was “a small but significant error”, Randle said. He had experience as an analyst and discovered the error after he asked council staff for the spreadsheet of cost-benefit analysis.


The mistake was a result of council staff taking a figure which had already factored in the rate of under-reported crashes, and then factoring in under-reporting again.

This meant the safety benefits of the decision to reduce speeds across Wellington were overstated by hundred of millions of dollars.

If someone can run faster than I’m allowed to drive then the speed limit is too low. jmo.

2 Likes

I wonder if Usain Bolt has any speeding tickets?

1 Like

No, but was probably pulled over for some reason. Taillights not existing, not staying in his lane, etc.

A 30 km/h / 20 mph speed limit probably isn’t that unreasonable in a very congested, pedestrian-heavy space.

Whether the streets in Wellington the proposal was targeting (it sounds like 80% of them) qualify…I haven’t been to Wellington before, so I can’t say.

You make a good point.

I have friends in Wellington - it’s a lovely city. However, be careful after dark -

Wellington Paranormal is also fun… (a spin-off)

back on topic – if you’re going to falsify the data in your academic study, don’t do it in Excel

did you know it keeps track of the alterations you made?

CalcChain is so useful here because it will tell you whether a cell (or row) containing a formula has been moved, and where it has been moved to. That means that we can use calcChain to go back and see what this spreadsheet may have looked like back in 2010, before it was tampered with!

Because column C has formulas, calcChain needs to record in which order to solve them. Importantly, it will have the order in which those formulas were initially entered into the spreadsheet. It will indicate to first solve C2, then C3, and so on. Critically, when a cell is moved, its order of calculation is not. That means that in the example above, Excel continues to compute 67 right after it computes 57, and right before it computes 77, no matter where you move that cell to*.

Analyses of calcChain for the 5 other out-of-sequence observations similarly support the hypothesis that an analyst (manually) moved observations from one condition to the other. Click the links below to see them.

1. Row 68

2. Rows 67 & 72

3. Row 69

4. Row 71

5. Rows look orderly elsewhere

Part 2:

In this case, it was the content more than any spreadsheet features themselves that indicated something was afoot:

The Anomaly: Strange Demographic Responses

As mentioned above, students in this study were asked to report their demographics. Here is a screenshot of the posted original materials, indicating exactly what they were asked and how:

We retrieved the data from the OSF (OSF | The Moral Virtue of Authenticity), where Gino (or someone using her credentials) posted it in 2015. The anomaly in this dataset involves how some students answered Question #6: “Year in School.”

The screenshot below shows a portion of the dataset. In the “yearSchool” column, you can see that students approached this “Year in School” question in a number of different ways. For example, a junior might have written “junior”, or “2016” or “class of 2016” or “3” (to signify that they are in their third year). All of these responses are reasonable.

A less reasonable response is “Harvard”, an incorrect answer to the question. It is difficult to imagine many students independently making this highly idiosyncratic mistake. Nevertheless, the data file indicates that 20 students did so. Moreover, and adding to the peculiarity, those students’ responses are all within 35 rows (450 through 484) of each other in the posted dataset:

So, it seems somebody did a copy/paste falsification of the data to make the results what they wanted… and they were super-lazy in falsifying.

Every “normal” observation is represented as a blue dot, whereas the 20 “Harvard” observations are represented as red X’s:

Indeed, we should note that while for all four studies covered in this series we found evidence of data tampering, we do not believe (in the least) that we’ve identified all of the tampering that happened within these studies. Without access to the original (un-tampered) data files – files we believe Harvard had access to – we can only identify instances when the data tamperer slipped up, forgetting to re-sort here, making a copy-paste error there. There is no reason (at all) to expect that when a data tamperer makes a mistake when changing one thing in a database, that she makes the same mistake when changing all things in that database.

3 Likes

I love the fact that her book is sub-titled thusly:

3 Likes

She must not have had the chapter - “How to hide your tracks when tampering with data”

2 Likes

part 3 is here:

We believe that these observations are out of sequence because their values were manually altered (after they were sorted) to produce the desired effect [4].

I did my podcast before they posted part 3:

1 Like

You should have released the parts out of sequence.

1 Like

And now Francesca Gino is now suing the Data Colada guys & Harvard.

Given at least one of their results, where it is pretty damn clear the data were tampered with and Harvard almost definitely has the original, untampered data… this seems very foolish on her part.

But hey. Maybe she’s trying to see if she can negotiate her way into a settlement w/ Harvard. I doubt the Data Colada guys will settle.

If Gino has tenure w/ Harvard, then maybe she does have an action, even if she did commit fraud.

Summary

THE NATURE OF THIS ACTION

  1. Plaintiff Francesca Gino (“Plaintiff” or “Professor Gino”) is employed by Harvard University as a tenured Professor at the Harvard Business School.
  2. Plaintiff is an internationally renowned behavioral scientist, author, and teacher. She has written over 140 academic articles, both as an author and as a co-author, exploring the psychology of people’s decision-making.
  3. Plaintiff has never falsified or fabricated data.
  4. In July 2021, a trio of professors and behavioral scientists (all male), Defendant Uri Simonsohn, Defendant Leif Nelson, and Defendant Joseph Simmons, who have a blog named “Data Colada,” (and who are collectively referred to herein as “Data Colada”), approached Harvard Business School with alleged concerns about perceived anomalies and “fraud” in the data of four studies in academic articles authored by Plaintiff.
  5. Data Colada threatened to post the “fraud” allegations on their blog, thereby subjecting Plaintiff, and by extension, Harvard Business School, to public scrutiny.
  6. Without Plaintiff’s knowledge, Harvard University and the Dean of Harvard Business School, Defendant Srikant Datar (“Dean Datar”), negotiated an agreement with Data Colada pursuant to which Harvard Business School investigated the allegations, in accordance with a new employment policy created solely for Plaintiff, in exchange for Data Colada’s silence during the investigation period. Unbeknownst to Plaintiff, Harvard Business School further agreed to disclose the outcome of the investigation to Data Colada, who could then subject Plaintiff’s work and professional reputation to public disparagement on its blog.
  7. Pursuant to its negotiations with Data Colada, in August 2021, Harvard Business School created the “Interim Policy and Procedures for Responding to Allegations of Research Misconduct” (“Interim Policy”) just for Plaintiff, which included a range of potential sanctions, including termination of employment.
  8. Under said Interim Policy, a finding of research misconduct required an investigation committee to prove, by a preponderance of the evidence, that Plaintiff “intentionally, knowingly, or recklessly” falsified or fabricated data, and to specify for each allegation the requisite intent.
  9. Under said Interim Policy, as with any other policy at Harvard, allegations were required to be made in good faith, and an investigation was required to be fair. Neither of those things happened in this case.

TL;DR the whole thing. One argument she’s using is that it wasn’t her who analyzed the data, so it’s unfair to say she committed fraud, it was the analyst. I don’t know how that works, legally. I mean, it’s her name on the manuscript so presumably she has some onus to ensure that the analysis was conducted properly. If multiple manuscripts have serious data errors (which appears to be the case), was it always the same analyst(s)? If she’s working with different people, seems unlikely that in each case that analyst is fudging the data.

Yeah I searched for “assistant” in the document (as in research assistant)… it’s pretty clear that data was tampered with in many cases.

Maybe she didn’t know, but here’s a question: what happened to RAs who didn’t get interesting results?

2 Likes
  1. This statement was demonstrably and verifiably false. The data provided to HBS by Professor Gino’s former research assistant was in an Excel spreadsheet and was not the “original dataset” for Study 1. The “original dataset” for Study 1 had been collected on paper in 2010, a fact that was clearly documented in the original 2012 PNAS Paper.
  1. Data Colada also knew that the so-called “duplicate observation” was not evidence of tampering, as it was equally likely that the same index card was used for participants’ IDs or the research assistant who conducted the study entered the ID twice—an honest error.
  1. Importantly, Data Colada also acknowledged in this recent blog post that it knows that an author of a study is not always the person responsible for handling or collecting the data and, therefore, is also not always the person responsible for any resulting “data irregularities.” See id. Doubtless, as behavioral scientists at leading universities, Data Colada knew or had reason to know that Professor Gino works with research assistants and others and may not have been the person responsible for any perceived anomalies in studies she authored.

(https://twitter.com/danengber/status/1686851180771557376)

and more:

The Hartford, an insurance company that collaborated with Ariely on one implicated study, told NPR this week in a statement that it could confirm that the data it had provided for that study had been altered after they gave it to Ariely, but prior to the research’s publication: “It is clear the data was manipulated inappropriately and supplemented by synthesized or fabricated data.”

Ariely denies that he was responsible for the falsified data. “Getting the data file was the extent of my involvement with the data,” he told NPR.