Course Description
Data Preparation and Processing focuses on the critical steps involved in preparing raw data for analysis and ensuring its quality and usability. The course covers techniques for cleaning, transforming, and organizing data, including handling missing values, outlier detection, normalization, and feature engineering. Students will explore methods for working with various data types and formats, such as structured, unstructured, and semi-structured data. Additionally, the course emphasizes the importance of data preprocessing for ensuring accurate and reliable results in data analysis and machine learning applications. By the end of the course, students will be able to effectively prepare datasets for analysis in real-world scenarios. (3 credits)
Prerequisite
- DAT 201: Principles of Data Science
Student Learning Outcomes (SLOs)
Students who successfully complete this course will be able to:
- Identify common data quality issues, including missing values, outliers, and inconsistencies, and explain their potential impact on data analysis and machine learning outcomes.
- Apply data cleaning techniques to handle missing values, correct data inconsistencies, and manage outliers, ensuring data accuracy and reliability.
- Transform raw data using normalization, scaling, and encoding techniques, preparing datasets for machine learning algorithms and advanced analysis.
- Perform feature engineering, including feature selection and extraction, to enhance model performance and improve the interpretability of data insights.
- Distinguish between structured, unstructured, and semi-structured data, demonstrating effective methods for processing and organizing each type for analysis.
- Prepare datasets for real-world analytical tasks by implementing a complete data preprocessing pipeline, from data cleaning through feature engineering, that supports accurate analysis and predictive modeling.
Course Activities and Grading
Assignments | Weight |
---|---|
Discussions (Weeks 1-7) | 10% |
Homework Assignments (Weeks 1-3, & 5-7) | 40% |
Midterm Project (Week 4) | 17% |
Final Project: Dataset Selection (Week 8) | 8% |
Final Project: Jupyter Notebook (Week 8) | 25% |
Total | 100% |
Required Textbook
This course uses Open Educational Resources (OER). OER are openly licensed, educational resources that can be used for teaching, learning and research. OER may consist of a variety of resources such as textbooks, videos and software that are no cost for students.
Course Schedule
Week | SLOs | Readings and Exercises | Assignments |
1 | 1,2 | Topic: Data Cleaning and Imputation |
|
2 | 4 | Topic: Mutual Information and Feature Selection |
|
3 | 3 | Topic: Feature Extraction, Feature Scaling, Encoding, and Binning |
|
4 | 1,2,3,4 | Topic: Midterm Project |
|
5 | 4 | Topic: Clustering |
|
6 | 4 | Topic: Principles Component Analysis |
|
7 | 5,6 | Topic: Data Types & Preprocessing Pipelines |
|
8 | 1,2,3,4,5,6 | Topic: Final Project |
|
COSC Accessibility Statement
Charter Oak State College encourages students with disabilities, including non-visible disabilities such as chronic diseases, learning disabilities, head injury, attention deficit/hyperactive disorder, or psychiatric disabilities, to discuss appropriate accommodations with the Office of Accessibility Services at OAS@charteroak.edu.
COSC Policies, Course Policies, Academic Support Services and Resources
Students are responsible for knowing all Charter Oak State College (COSC) institutional policies, course-specific policies, procedures, and available academic support services and resources. Please see COSC Policies for COSC institutional policies, and see also specific policies related to this course. See COSC Resources for information regarding available academic support services and resources.