Paxata Community Members: Something special in a community experience is coming your way. Stay tuned to this space.
Paxata and Data Prep for Data Science
What is Data Preparation for Machine Learning?
Data preparation is the process of transforming raw data so that it's properly prepared for the machine learning algorithms used to uncover insights and make predictions.
Why is Data Preparation Important?
Most machine learning algorithms require data to be formatted in very specific ways. Which means your raw datasets generally require some amount of preparation before they can yield useful insights. For example, some datasets have values that are missing or invalid. If data is missing, the algorithm can’t use it. And if data is invalid, the algorithm produces less accurate or even misleading outcomes. Good data preparation produces clean and well-curated data that leads to more practical, accurate model outcomes.
So what can I do to prep my data?
Paxata provides the transformation tools you need to clean, normalize, and shape your data. And once you've cleaned your data, Paxata also provides the tools you need to prepare your Features for optimal feature engineering.
Here's just a short list of how Paxata can help you to quickly prep your data to train your ML models:
- join datasets for feature enrichment
- add target variables to a training dataset
- normalize messy categorical variables- for example NY versus New York, CA versus California
- select only specific variables to save in a training dataset
- remove unwanted observations
- format dates so they are recognized by the training model
- identify and redress missing or incomplete values
- bin ranges into categories
- compute windowed aggregates- for example rolling up sales transactions to a daily level
- profile a dataset to understand your data prior to prep
- perform exploratory data analysis