Data Profiling Properties

The MetaKarta repository, API and UI support the following profiling details (row counts):

Count:  number of rows actually profiled, which is either the total number in the source or the limit set when defining the harvesting options)

Null – rows which are mull.

Distinct rows:  non-distinct=total-distinct-empty. For example, when there is one million rows and the column has much less (e.g. 10) distinct values, the data is considered to be distinct.

Duplicate rows:  rows with identical values for this field

Valid rows:  rows with valid contents for this field

Empty rows:  null in database or empty in files

Invalid rows: rows without valid contents for this field

The valid/invalid depends upon the datatype that has been autodetected for the column. For example, if the first column was identified as an INTEGER data type but the value in the last record contains the value “a“, which is not a valid INTEGER, it would contribute to the invalid counter.

Average length:  average of the lengths of each value profiled

Min length:  lowest of the lengths of each value profiled

Max length:  highest of the lengths of each value profiled

Min value:  lowest value

Max value:  highest value

Values [value, rows]:  distribution of values and their frequency

Patterns [pattern, rows]: list of different patterns of data presentation discovered in the source and frequency

Data Types [type, rows]:  list of data type matches and frequency. The column data type detected by the profiler. When a column has data of different data types the profiler pick the most used one. You can overwrite the value manually. The value could contradict the data type declared by the database. For example, when VARCHAR database column contains only date values, the profiler sets the DATE data type. Here is the list of supported types:

-       Text

-       Date

-       Time

-       DateTime

-       Geographical

-       No Percentiles

-       Means, Median

-       Variance

-       Std. Deviation

-       Number

Inferred Data Type:  Inferred Data Types after dataprofiling the object.

Data classes:  list of data classes matched and likelihood as a percentage.