Data Quality CSV Format

As the csv file contents are derived from different aspect of the source tool(s) and  can be using different methodologies to produce this information, they cannot be directly correlated across these different aspects, including:

-       Data Quality Rules (e.g., Rule Application.csv)

-       Data Sampling (e.g.,Data Sampling.csv)

-       Profiling (e.g., Tables.csv)

 

Rules (folder with Data Quality information)

-       Folder.csv (allows to recreate an hierarchy of folders in order to organize DQ rules)

·  Id (folder unique identifier)

·  Name (folder name)

·  Parent Folder Id (identifier of the parent folder, optional)

·  Description (folder description, optional)

-       Rule.csv (provides DQ rule declarations)

·  Id (rule unique identifier)

·  Name (rule name. Name must be unique under the parent folder.)

·  Folder Id (identifier of the parent folder)

·  Description (rule description, optional)

·  Operation (rule operation, optional)

·  Dimension (rule dimension name. See the list of the possible values below)

·  Url (hyperlink on the rule in the original DQ tool, optional)

-       Rule Application.csv (allows to specify DQ rule instances and associate them with the individual columns)

·  Id (rule application unique identifier)

·  Rule Id (identifier of the associated DQ rule)

·  Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)

·  Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)

·  Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)

·  Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )

·  Table (column path, table name. It can also be a file, or partitioned directory name)

·  Column (column path, column name)

·  Description (rule application description, optional)

·  Threshold (rule application threshold. It is a floating point value that is used to compute DQ Status in MM (Good, Acceptable or Bad). Use comma as a delimiter if more than one value is required)

-       Rule Application Column.csv (provides additional associations between DQ Rule Applications and the corresponding columns if Rule Application refers to two or more columns)

·  Rule Application Id (rule application identifier)

·  Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)

·  Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)

·  Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)

·  Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )

·  Table (column path, table name. It can also be a file, or partitioned directory name)

·  Column (column path, column name)

-       Rule Measure.csv (provides rule execution results)

·  Id (rule measure unique identifier)

·  Rule Application Id (rule application identifier)

·  Date (the execution date when the values have been collected. See below the supported date time format)

·  Rows Total (the overall number of the processed rows)

·  Rows Failed (the number of rows that failed the DQ rule)

Profiling (folder with Data Profiling information, optional)

-       Table.csv (table level profiling information)

·  Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)

·  Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)

·  Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)

·  Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )

·  Table (table path, table name. It can also be a file, or partitioned directory name)

·  Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)

·  Data Update Date (the last date/time when data has been updated in the source table. See below the supported date time format)

·  Records All (the number of records in the source table)

·  Records Processed (the number of records that have been processed by the profiler)

-       Column.csv (column level profiling information)

·  Type (column path, connection type. Is needed to interpret and then correctly stitch the referenced column in MM. See the list of the possible values below.)

·  Server (column path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)

·  Catalog (column path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)

·  Schema (column path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )

·  Table (column path, table name. It can also be a file, or partitioned directory name)

·  Column (column path, column name)

·  Creation Date (the date/time when data profiling information has been collected. See below the supported date time format)

·  Count (KPI with the count of all the values in the column. Long value)

·  Distinct (KPI with the counter for distinct values. Long value)

·  Duplicate (KPI with the counter for duplicate values. Long value)

·  Blank (KPI with the counter for blank string values. Long value)

·  Zero (KPI with the counter for zero numeric values. Long value)

·  Null (KPI with the counter for NULL values. Long value)

·  Min (KPI with the minimum value in the column. Usually applicable for the numeric values.)

·  Max (KPI with the maximum value in the column. Usually applicable for the numeric values.)

·  Mean (KPI with the mean value in the column. Usually applicable for the numeric values.)

·  Variance (KPI with the variance value in the column. Usually applicable for the numeric values.)

·  Median (KPI with the median value in the column. Usually applicable for the numeric values.)

·  Lower quantile (KPI with the lower quantile value in the column. Usually applicable for the numeric values.)

·  Upper quantile (KPI with the upper quantile value in the column. Usually applicable for the numeric values.)

·  Avg length (KPI with the average value length in the column. Usually applicable for the string values. Double value)

·  Min length (KPI with the minimum value length in the column. Usually applicable for the string values. Long value)

·  Max length (KPI with the maximum value length in the column. Usually applicable for the string values. Long value)

·  Valid (KPI with the counter for valid values. Usually computed based on the column’s data type. Long value)

·  Invalid (KPI with the counter for invalid values. Usually computed based on the column’s data type. Long value)

·  Data Type Invalid Values (Serialized list of the invalid values. Usually computed based on the column’s data type. Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)

·  Data Type Frequency (Serialized list of data type frequencies in the format “[value],[count],[ratio]” (e.g. INTEGER,6,60.00;STRING,4,40.00). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)

·  Pattern Frequency (Serialized list of pattern frequencies in the format “[value],[count],[ratio]” (e.g. 999-99-9999,19972,1.00000). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)

·  Value Frequency (Serialized list of value frequencies in the format “[value],[count],[ratio]” (e.g. 987-66-3365,1,0.00005;987-66-5400,1,0.00005). Use character semicolon ‘;’ as a delimiter and backward slash ‘\’ as an escape character to serialize multiple values)

Sampling (folder with Data Sampling information, optional)

-       Data Sampling.csv (mapping file that allows to lookup Data Sampling information for the individual tables/files. Is used to resolve special characters that can be part of the table names but not allowed in the file system to name the files accordingly)

-       *[file path recreated from ‘Data Sampling.csv’] (e.g. "\JDBC\172.0.0.1\FinanceDWDimensional\dbo\Customer.csv")

-       Table Url.csv (file that allows to specify a hyperlink to the original DQ source for the specified table, optional)

·  Type (table path, connection type. Is needed to interpret and then correctly stitch the referenced table in MM. See the list of the possible values below.)

·  Server (table path, server name. Database host:port, Snowflake warehouse, file service, like gs://bucket)

·  Catalog (table path, catalog name. It can be a Snowflake or SQL Server database. Empty for Oracle and File systems.)

·  Schema (table path, schema name. It can be a database schema, file folder path, like /dir1/dir2 )

·  Table (table path, table name. It can also be a file, or partitioned directory name)

·  Url (a hyperlink on the corresponding table record in the original DQ tool)