Common data preparation tasks in SQL
- Eliminating duplicates:
- Identify the columns that should be used to determine whether a row is a duplicate.
- Use the
DISTINCT
keyword to return only unique rows, or use theGROUP BY
clause to group rows by the selected columns and use an aggregate function likeCOUNT()
to identify duplicate rows. - Use the
DELETE
statement to remove duplicate rows from the table.
Example SQL code to eliminate duplicates based on a single column:
SELECT DISTINCT column_name
FROM table_name;
Example SQL code to eliminate duplicates based on multiple columns:
SELECT DISTINCT column_name_1, column_name_2
FROM table_name;
- Ensuring data format consistency:
- Identify the columns that should have consistent data formats.
- Use the
CAST()
function to convert data types to the desired format. - Use the
SUBSTR()
function to extract a substring from a column if needed. - Use the
CONCAT()
function to combine columns if needed.
Example SQL code to convert a string column to an integer column:
SELECT CAST(column_name AS INTEGER)
FROM table_name;
Example SQL code to extract a substring from a string column:
SELECT SUBSTR(column_name, start_index, length)
FROM table_name;
Example SQL code to concatenate two columns:
SELECT CONCAT(column_name_1, ' ', column_name_2)
FROM table_name;
- Validating data values:
- Identify the columns that should have valid data values.
- Use the
WHERE
clause to filter out rows with invalid data values. - Use the
LIKE
operator to filter out rows with data values that do not match a specific pattern. - Use the
IN
operator to filter out rows with data values that are not in a specific list.
Example SQL code to filter out rows with invalid data values:
SELECT column_name
FROM table_name
WHERE column_name > 0;
Example SQL code to filter out rows with data values that do not match a specific pattern:
SELECT column_name
FROM table_name
WHERE column_name LIKE 'A%';
Example SQL code to filter out rows with data values that are not in a specific list:
SELECT column_name
FROM table_name
WHERE column_name IN ('A', 'B', 'C');
By using these SQL code snippets and checklist, you can perform common data preparation tasks to ensure that your data is clean and ready for analysis.