Visit the official Paxata Documentation portal for all of your doc needs.

Criterion for detecting a "Date" data-type in the Profiler

What is the criteria for Profiler to detect dates in data sources?
I have a plain CSV file and one of the columns has "date" inside it as show below.
The profiler does not detect this as date at all. 0% is reported under date data-type.

For e.g.
"store-id","date","totalsales"
"ABC12","1/1/2018","20321"
"ABC12","1/2/2018","$39403"
"CDB90","1/1/2018","3,034"
"UID10","1/1/2018","29302"
"OSS23","1/1/2018","20393"

Answers

  • Hello MagmaMan,

    The profiling is detected based on the type that's coming in from the source.

    In your case, the data  is in a CSV format, which is all text and hence the result reported in the profile is not a date type. 

    However it's possible to detect the date format through pattern matching.

    Thank You for your question,

    Akshay


  • @Akshay
    Well, the Sales is identified as Numeric.... CSV does not mean everything has to be Text. The explanation is not convincing....
  • @MagmaMan

    When importing datasets, we need to consider the source of the data when discussing column data types: 
    • Some formats and systems provide column type metadata.  If the system provides the metadata, we use this to identify column types.  Example: databases, Parquet files, Excel files
    • Some formats are basically text files. For these formats, When Paxata imports the data, we look at cell contents to identify the predominant type of data.  Example: CSV, TSV, Positional format, XML, JSON
    • Some parsers have options that instruct Paxata to attempt to interpret column data types during import for textual datasets. 
    Looking a back at your example:
    You likely had the "Parse cells into numbers, dates, and booleans" option enabled on the CSV during import


    This configuration leads to the following: 
    • "store-id": cells contain alphanumeric data, therefore we specify the type as text
    • "date": There are many formats for dates, so identification by looking at cell contents is far too imprecise. Cells contains numbers and non-numeric "/", therefore we specify the type as text. 
    • "totalsales": cells contain numbers and characters that are typically present in some numeric formats ('$', ','), therefore we specify the type as numeric. 
    If you disable the "Parse cells into numbers, dates, and booleans" option, each column type will be defined as text. 

    If you open the CSV in Excel, set the column type to Date and import the resulting Excel file, the "date" column type will be set from the Excel type. 

    Thanks,
    Bill
Sign In or Register to comment.