Common data preparation tasks in SQL

- Eliminating duplicates:
- Identify the columns that should be used to determine whether a row is a duplicate.
- Use the
DISTINCTkeyword to return only unique rows, or use theGROUP BYclause to group rows by the selected columns and use an aggregate function likeCOUNT()to identify duplicate rows. - Use the
DELETEstatement to remove duplicate rows from the table.
Example SQL code to eliminate duplicates based on a single column:
SELECT DISTINCT column_name
FROM table_name;
Example SQL code to eliminate duplicates based on multiple columns:
SELECT DISTINCT column_name_1, column_name_2
FROM table_name;
- Ensuring data format consistency:
- Identify the columns that should have consistent data formats.
- Use the
CAST()function to convert data types to the desired format. - Use the
SUBSTR()function to extract a substring from a column if needed. - Use the
CONCAT()function to combine columns if needed.
Example SQL code to convert a string column to an integer column:
SELECT CAST(column_name AS INTEGER)
FROM table_name;
Example SQL code to extract a substring from a string column:
SELECT SUBSTR(column_name, start_index, length)
FROM table_name;
Example SQL code to concatenate two columns:
SELECT CONCAT(column_name_1, ' ', column_name_2)
FROM table_name;
- Validating data values:
- Identify the columns that should have valid data values.
- Use the
WHEREclause to filter out rows with invalid data values. - Use the
LIKEoperator to filter out rows with data values that do not match a specific pattern. - Use the
INoperator to filter out rows with data values that are not in a specific list.
Example SQL code to filter out rows with invalid data values:
SELECT column_name
FROM table_name
WHERE column_name > 0;
Example SQL code to filter out rows with data values that do not match a specific pattern:
SELECT column_name
FROM table_name
WHERE column_name LIKE 'A%';
Example SQL code to filter out rows with data values that are not in a specific list:
SELECT column_name
FROM table_name
WHERE column_name IN ('A', 'B', 'C');
By using these SQL code snippets and checklist, you can perform common data preparation tasks to ensure that your data is clean and ready for analysis.
