Data Profiling Properties

The MetaKarta repository, API and UI support the following profiling details (row counts):

o Count: number of rows actually profiled, which is either the total number in the source or the limit set when defining the harvesting options)

o Null – rows which are mull.

o Distinct rows: non-distinct=total-distinct-empty. For example, when there is one million rows and the column has much less (e.g. 10) distinct values, the data is considered to be distinct.

o Duplicate rows: rows with identical values for this field

o Valid rows: rows with valid contents for this field

o Empty rows: null in database or empty in files

o Invalid rows: rows without valid contents for this field

The valid/invalid depends upon the datatype that has been autodetected for the column. For example, if the first column was identified as an INTEGER data type but the value in the last record contains the value “a“, which is not a valid INTEGER, it would contribute to the invalid counter.

o Average length: average of the lengths of each value profiled

o Min length: lowest of the lengths of each value profiled

o Max length: highest of the lengths of each value profiled

o Min value: lowest value

o Max value: highest value

o Values [value, rows]: distribution of values and their frequency

o Patterns [pattern, rows]: list of different patterns of data presentation discovered in the source and frequency

o Data Types [type, rows]: list of data type matches and frequency. The column data type detected by the profiler. When a column has data of different data types the profiler pick the most used one. You can overwrite the value manually. The value could contradict the data type declared by the database. For example, when VARCHAR database column contains only date values, the profiler sets the DATE data type. Here is the list of supported types:

- Text

- Date

- Time

- DateTime

- Geographical

- No Percentiles

- Means, Median

- Variance

- Std. Deviation

- Number

o Inferred Data Type: Inferred Data Types after dataprofiling the object.

o Data classes: list of data classes matched and likelihood as a percentage.