Apologies if my intro is a bit long winded. But anyway, I graduated from college in 2000. I was aware of actuary, but I didn’t yet know it would be my path. What I did know is that I should learn SQL. My gf at the time was 1 year behind me, so she bought me an academic SQL book from the university bookstore. So that was my introduction to SQL. Then I moved on to community college courses. At the time, Oracle was considered the gold standard so that’s what I learned.
To make a long story short, my experience learning SQL is very different than my younger colleagues who learned it on the job. That said, here are my 3 cardinal sins of SQL.
Unnecessary use of a Left Join. There must be a reason for a Left Join. Otherwise, use an Inner Join
Putting Left Join criteria in the WHERE clause. If you put Left Join criteria in the WHERE clause, you negate the Left Join. Put the criteria in the JOIN clause.
Don’t use NOT IN. Just don’t. In order to evaluate a NOT IN, you must check every record. If the table in question has 500,000,000 records, that’s gonna take a while.
All good imo
SQL queries and results depend mainly on the data and how it is constructed. Maybe you learn structure and the language of queries from a book, but one needs to know field names, whether the field names in one table are defined the same way as those in another table, etc. And often this is not an easy thing, since it requires asking someone who knows. And maybe that person is the one you just replaced. This is a lot easier when the IT Team understands the importance of documentation (and understands that others will be using their databases). That required a new philosophy of IT not being the only people allowed to query (when I first started and at other substandard places I worked, this was the way: you had to ask others to extract data for your work).
Oh, and WHICH SQL is important. I learned using MS-SQL, which is quite different from the Oracle versions I use now. Mostly involving how to manipulate dates.
I think there are exceptions to all of these. If I want to include Life and Disability and Long Term Care and Annuity and Critical Illness and exclude Medical and Dental and Vision then saying that the coverage type is NOT IN the 3-item list of Medical/Dental/Vision is fine to use.
There’s other ways to accomplish the same thing, and I’d probably noodle around with what’s the fastest if I was going to run the program a lot. If the lion’s share of claims are medical then saying NOT IN and starting the list with Medical might be fastest.
I would also add that one needs to know how much temporary memory/storage there is. Query-within-query might do the job, but it might run up against the memory maximum.
“I will be querying against 500,000,000 records and complex-calculating and joining. But my business unit is only 5,000,000 of those records. Best for me to cut down the size of the data that I will be making complicated calculations and joins first.”
After a while, you might decide to make an automatically updated (once a month or so) table in some personal area, if you have access to a personal data area, so your queries will be done faster (leaving more time to post here!). If not, ask.
That works, too!
I’m more of a “Scottie” guy: “Captain, it could take days to complete this task!” (Gets it done in a few hours.) “Well, I just worked harder, because I care so much!”
I gave my IT group something that was essentially a join/aggregate process to create a file that was about 99% compressed after aggregation. What they wrote was was taking forever to run…uh, how are you doing that, my SQL runs in a couple minutes? Of course, they did the join first and aggregate second…ugh. Come on guys…
SQL is pretty easy. Also, the data platform you use is important: see analytical vs transactional databases. That alone won’t keep you from doing bad things though.
I’m doing similiar things inside R, and can go on to do the rest of my analysis in the same environment.
Probably doesn’t matter as much how I do it where I think we’re only sitting around a million records vs. the 500 million that were being discussed earlier. I can see where I’d run into memory problems with some of the stuff you’re doing. On the other hand, perhaps they’d give me a more powerful computer to do it.
I tend to use outer joins when pulling from auxillary tables. Like, getting a description with an ID. Or getting a category for a detail.
It usually will return the same results as an inner join. But if something is missing, I’d rather have the nulls than missing rows. Also, I think it clarifies the logic, slightly.
If I was trying to get a description I’d use a left join or right join (been a bit since I’ve done SQL coding… these are the same thing no? Just switch the order of the tables and the words right & left and I think it’s the same.)
I want to retain records for which there are no descriptions, but I couldn’t care less if there are descriptions with no records. If no one purchased an obscure rider last month I don’t care. And if no one purchased the product that we stopped selling 17 years ago then I definitely don’t care.
You gonna do a full table scan for any condition such as = or <> or IN etc. unless it’s an indexed column. Not sure what you have against NOT IN.
I’ve regularly done queries against a table with a billion rows and it’s never been a problem. You may want to put all your conditions for the monster table in a sub query though before joining.
Edit: it seems NOT EXISTS is more efficient for large datasets. I’ll try it next time. MINUS might also work.