Fastest way to check for no nulls in large data set

I’ve inherited a project and one of my current charges to validate data against production, as well as to validate that all the joins i’ve inherited are working properly.

I’m operating out snowflake, which is like my 4th or 5th sql variant(woohoo!) . I can also export search results to excel if need be.

In sql, you can run select distinct col_name1 to get the distinct values in col_name1, but i have 200+ records, so i’m wondering, what’s the fastest way to view distinct values for all columns? So if my data is col1, col2…col200n, my ideal output would be one table with col1 and underneath the header all distinct values from col1, then col2 and all disctinct values from col2, … concluding with col200n and all distinct values from col200n.

Is there any easy way to write this? I’m not a very sophisticated programmer, so i would know how to write a loop of some sort or do while then increment n+1.

Otherwise i could be stuck authoring, and reviewing 200+ select distinct col1 from table, select distinct col2 from table…

Any recommendations?

Till all are one,

Epistemus

You could potentially use dynamic sql.

For various good reasons most sql doesn’t let you turn column names (or table names) into variables/parameters. This is because sql is designed for working with more or less permanent tables, and this behavior is safer. Unfortunately it is less convenient for analysis work.

I don’t know how much programming you know. One option is to “generate” sql code in excel with paste commands, and then paste into sql as a query, if you don’t know any other language.

But 200 isn’t that bad, really. I bet you could copy and paste that inside 30 minutes.

Are you looking for how many values are not null, if so then would this work?
IS NOT NULL

Or are you looking for all distinct values in the column?

I suck at SQL.

I posited your query to chat gpt. This was its response:

I do not know what usefulness that is. You’re probably better off listening to the suggestions of the other posters itt.

Regards,
‘###

1 Like

Chatgpt answer is pretty much what I’d do, but I’d also involve Excel.

So you can get a list of fields in a particular table lots of different ways, I am not sure which is best for snowflake. I use MS’s SSMS to write things in T-SQL. Once you have the list, you can copy it into Excel, and do things like prepend "SELECT DISTINCT " to field name, then append “, ‘[Field_name_N]’ AS FldName FROM table UNION”. You would vary the value of [Field_name_N] for each row. The last Row would not have UNION. Paste all these rows back into your SQL editor and run. This would dump a list of all the distinct values associated with the field they are included in to a two column table.

I don’t normally use dynamic SQL, but here is an example of a similar task I do occasionally. If AXIS chokes in the midst of creating tables from grid output, you may see lots of tables in your SQL DB. Here is that query, slightly redacted. No excel step for this one.

/* Query to get rid of intermediate AXIS output tables
Use the first block of code to get the commands to drop the list of tables with names starting with desired value

Examine list to make sure no tables that are to be kept are in the list

Paste results from 1st query into query, double check, then execute those commands

note query could be written as a single step dynamic query, but since it is destructive by nature, it is better to implement in two steps with inspection between steps
*/

– block 1 - get commands from table list
SELECT ‘DROP TABLE dbo.[’ + TABLE_NAME + ‘]’ from INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME LIKE ‘AXIS_blahblah_0331_I_0%’ OR TABLE_NAME LIKE ‘AXIS_blahblah_0331_I_1%’
ORDER BY TABLE_NAME

– block 2 paste results here, then execute this block
DROP TABLE dbo.[AAC_blahblah_I_01_servername_run_id]

Any chance you can import your dataset into a python dataframe? More specifically, a panda dataframe?

Or even an R dataframe?

That would be a far more efficient means of finding out values for each column, IMO.

I don’t know what your use case is but if you want no nulls you can set the table constraints to not accept data with nulls.

That is a good approach for the data. Op might also be interested in results of joins that might be nulls.