So last time I looked into this, all of the solutions were pretty expensive. With the intro of ChatGPT and similar AI tools offered (relatively) free to the public, I am wondering if there might be a new solution out there I can use. Or potentially train ChatGPT to do correctly.
Basically, I have a bunch of reports produced by multiple entities from which we manually pull various pieces of information. The reports produced by a given entity generally follow a basic set-up from one year to the next, but not always. So the needed information can move around. Think of it like trying to pull a bunch of data from multiple companies’ financial statements (not what I am trying to do but that is close enough for discussion purposes).
Is there a tool out there that I can teach to pull the information I need and feed it into a DB or file that can be loaded into a DB? All of the reports I am planning to use are public information so there is not a data privacy concern here. I would not have an issue feeding them into ChatGPT knowing Open AI keeps all of the data entered.
I would worry about reliability. It seems like what you would really want would be something that showed you side by side the value being entered into your database, and where it is coming from in the report. I’d be interested in that too, but i don’t know of anything line that which exists.
An example of the reliability issue: chatgpt will confidently say that russia has sent many bears into space, and provide references.
I read about one successful use of chatgpt in which phone bank workers used it to provide technical support. It seems to be functioning more like a search engine, but people are naturally checking the result. That makes sense to me as something that might work well.
Your application is different because it cuts people out of the workflow entirely rather than helping them complete parts of it more quickly, with no easy opportunity to apply common sense to the answers being provided.
I know where the information is coming from in the report. And it isn’t of such volume that it can’t (won’t) be reviewed. But it would save a significant # of man hours to have it automatically entered by a computer and manually reviewed by a human. Right now it is hand entered by a human, which is prone to mistakes, and reviewed by a 2nd human, which doesn’t always catch the first human’s typos.
I’d much rather have a computer input “purple” than a human enter “237.2” when the answer is supposed to be 237.5. The first is easier to catch, and over time if we can teach the AI to understand why purple isn’t correct we can presumably improve initial accuracy.
The technology exists now to ocr the reports into text, and the use regular expressions to find certain numbers, if that is good enough. or you could maybe try asking chatgpt something like “tell me XY from this report: …”
Or you could try manually first and see if it works. If so then somebody might be able to build you something.
Try it out. Copy and paste the document to GPT, and then ask GPT to find the stuff you want from it.
Chat-GPT is free. And GPT-4 is only $20/month. See how it does.
Usually, with that kind of project-- I use a bunch of Excel, VBA, Regex, etc. to locate the bits that I want. I also usually add some kind internal validation-- basically a list of functions that will --break-- if something has been moved.
You might actually want to ask GPT for help with that instead. Like ask it to make a script that pulls all the numbers as well as their immediate context.
There are applications already out there built over ChatGPT that allow you to extract information from PDF files. The ones that I tried did a pretty good job of locating the information I was seeking.
You still have to use the chat interface, however. There may be API access once these apps figure out their business models.