Fastest way to check for no nulls in large data set

epistemus · May 22, 2023, 3:54pm

I’ve inherited a project and one of my current charges to validate data against production, as well as to validate that all the joins i’ve inherited are working properly.

I’m operating out snowflake, which is like my 4th or 5th sql variant(woohoo!) . I can also export search results to excel if need be.

In sql, you can run select distinct col_name1 to get the distinct values in col_name1, but i have 200+ records, so i’m wondering, what’s the fastest way to view distinct values for all columns? So if my data is col1, col2…col200n, my ideal output would be one table with col1 and underneath the header all distinct values from col1, then col2 and all disctinct values from col2, … concluding with col200n and all distinct values from col200n.

Is there any easy way to write this? I’m not a very sophisticated programmer, so i would know how to write a loop of some sort or do while then increment n+1.

Otherwise i could be stuck authoring, and reviewing 200+ select distinct col1 from table, select distinct col2 from table…

Any recommendations?

Till all are one,

Epistemus

magillaG · May 22, 2023, 4:11pm

You could potentially use dynamic sql.

For various good reasons most sql doesn’t let you turn column names (or table names) into variables/parameters. This is because sql is designed for working with more or less permanent tables, and this behavior is safer. Unfortunately it is less convenient for analysis work.

I don’t know how much programming you know. One option is to “generate” sql code in excel with paste commands, and then paste into sql as a query, if you don’t know any other language.

But 200 isn’t that bad, really. I bet you could copy and paste that inside 30 minutes.

joejam · May 22, 2023, 4:13pm

Are you looking for how many values are not null, if so then would this work?
IS NOT NULL

Or are you looking for all distinct values in the column?

1695814 · May 22, 2023, 4:23pm

I suck at SQL.

I posited your query to chat gpt. This was its response:

To view distinct values for all columns in your table, you can use the following approach:

Generate a list of columns in your table: You can query the metadata of your table to retrieve the column names. In Snowflake, you can use the DESCRIBE TABLE statement to get the column names.

Construct a dynamic SQL query: Based on the column names obtained in the previous step, you can construct a dynamic SQL query that includes a SELECT DISTINCT statement for each column. The query should concatenate all the individual results into a single result set.

Execute the dynamic SQL query: Execute the constructed SQL query to obtain the distinct values for all columns.

Here’s an example SQL code snippet that demonstrates this approach:

sqlCopy code
-- Step 1: Generate a list of columns in your table SELECT COLUMN_NAME FROMINFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'your_table_name' AND TABLE_SCHEMA ='your_schema_name' ORDER BY ORDINAL_POSITION; -- Step 2: Construct a dynamic SQL queryWITH column_names AS ( SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERETABLE_NAME = 'your_table_name' AND TABLE_SCHEMA = 'your_schema_name' ORDER BYORDINAL_POSITION ) SELECT 'SELECT ''' || COLUMN_NAME || ''' AS ColumnName, ' || 'DISTINCT ' || COLUMN_NAME || ' AS DistinctValue FROM your_table_name' AS sql_query FROMcolumn_names; -- Step 3: Execute the dynamic SQL query -- Copy and execute the generated SQL query from the previous step
The above code will generate a series of SELECT DISTINCT statements for each column in your table, along with the column name as a header. The results will be concatenated into a single result set.

You can copy and execute the generated SQL query to obtain the distinct values for all columns in your table. The result will be a table with two columns: “ColumnName” and “DistinctValue”, where “ColumnName” represents the column name, and “DistinctValue” represents the distinct values for each column.

Please replace 'your_table_name' and 'your_schema_name' with the actual name of your table and schema, respectively.

I do not know what usefulness that is. You’re probably better off listening to the suggestions of the other posters itt.

Regards,
‘###

SteveGrondin · May 22, 2023, 8:58pm

Chatgpt answer is pretty much what I’d do, but I’d also involve Excel.

So you can get a list of fields in a particular table lots of different ways, I am not sure which is best for snowflake. I use MS’s SSMS to write things in T-SQL. Once you have the list, you can copy it into Excel, and do things like prepend "SELECT DISTINCT " to field name, then append “, ‘[Field_name_N]’ AS FldName FROM table UNION”. You would vary the value of [Field_name_N] for each row. The last Row would not have UNION. Paste all these rows back into your SQL editor and run. This would dump a list of all the distinct values associated with the field they are included in to a two column table.

I don’t normally use dynamic SQL, but here is an example of a similar task I do occasionally. If AXIS chokes in the midst of creating tables from grid output, you may see lots of tables in your SQL DB. Here is that query, slightly redacted. No excel step for this one.

/* Query to get rid of intermediate AXIS output tables
Use the first block of code to get the commands to drop the list of tables with names starting with desired value

Examine list to make sure no tables that are to be kept are in the list

Paste results from 1st query into query, double check, then execute those commands

note query could be written as a single step dynamic query, but since it is destructive by nature, it is better to implement in two steps with inspection between steps
*/

– block 1 - get commands from table list
SELECT ‘DROP TABLE dbo.[’ + TABLE_NAME + ‘]’ from INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME LIKE ‘AXIS_blahblah_0331_I_0%’ OR TABLE_NAME LIKE ‘AXIS_blahblah_0331_I_1%’
ORDER BY TABLE_NAME

– block 2 paste results here, then execute this block
DROP TABLE dbo.[AAC_blahblah_I_01_servername_run_id]

Vorian_Atreides · May 23, 2023, 1:06am

Any chance you can import your dataset into a python dataframe? More specifically, a panda dataframe?

Or even an R dataframe?

That would be a far more efficient means of finding out values for each column, IMO.

colonelsmoothie · May 24, 2023, 5:18pm

I don’t know what your use case is but if you want no nulls you can set the table constraints to not accept data with nulls.

SteveGrondin · May 24, 2023, 6:08pm

That is a good approach for the data. Op might also be interested in results of joins that might be nulls.