Paxata Community Members: Something special in a community experience is coming your way. Stay tuned to this space.
My categorical variables are messy and inconsistent. How do I normalize them?
Often, you'll find that the values for your categorical variables are messy and inconsistent--for example NY versus New York. But there's
an easy way to clean this up using our Cluster and Edit algorithms. We'll provide
a quick example here of how you can do this in the app. And then for a complete
understanding of our clustering algorithms, please refer to our documentation
for the Cluster and Edit feature.
You receive a training dataset with a variable for state. Unfortunately, the values for the variable are inconsistent. There's "NewYorke", "Neu York", "CAlifornia", "California", etc.
Normalizing this column is super simple with the Cluster and Edit feature. From the variable's column, click the drop-down arrow and select "cluster + edit":
The Cluster+Edit menu opens. Notice that, by default, "metaphone" is the algorithm used for clustering, which is powerful for data with misspellings in which words sound similar but are spelled differently. However, for this example, let's switch to the "fingerprint" algorithm. It's the most restrictive of the algorithms and a good place to start because it groups
similar values into a cluster where the only differences are: punctuation, word
order and capitalization.
ProTip: if you're still exploring your data, start with the "fingerprint" algorithm. Make your edits. Then switch to the "metaphone" algorithm and cluster once more to see what other potential clusters exist in the data.
Now just enter the desired normalized values and click the green Save button.
You're done! And if you want to sanity check your work to see what else may be lurking in a variable's column, just open a Filtergram for that column.