Python+Batch Script for Data Dictionary(.XLSX) validation with Postgres Data Warehouse

Are you facing challenges related to discrepancies and validation in your Analytics team? Explore these situations:

  • Are manual validation methods introducing errors and inefficiencies?
  • Does simultaneous ETL development lead to ETL code regressions impacting your data warehouse?
  • Is there confusion validating tables and fields amidst ongoing development and feature additions?
  • Are unclear guidelines causing inconsistencies in documentation?

If you are facing above issues, then this article is for you. Above issues are the most common issues in Analytics team.

What is the purpose of this program ?

Technically: The developed program will read tables and fields from excel file, compares it with Postgres database, and generates the validation report – what fields/tables are missing or does not exist

This program simplifies analytics challenges by offering a user-friendly interface for easy schema selection. It generates clear .xlsx reports, allowing quick identification of existing and missing tables and fields (you can copy and paste data dictionary in .xlsx format and this program detects the table and fields and compares with your data warehouse). By comparing data dictionaries with your data warehouse, it strengthens data governance and facilitates prompt resolution to the underlying issues in your your data warehouse/ETL process/documentation.


For simplicity, lets consider you use following tools in your analytics team.

Data Warehouse: PostgreSQL

ETL Tool: Talend for Data Integration

Visualization and Reporting: Tableau

Documentation/Data Catalog: Confluence

It is because my article will be based on above tools most of the time, and they are pretty simple to understand. Please note, the script development process explained below will use postgresql python library.

In this article I am going to demonstrate the entire steps to develop a batch script to automate your data validation to streamline your analytics workflow.

Preparation :

Download this file from GitHub: https://github.com/kwagle7/Python-Bash-Script-for-Data-Dictionary-.XLSX-validation-with-Postgres-Data-Warehouse

Put all files and programs under: C:/test, and your window should look like following:

I am assuming your table views which have prefix view_ (i.e. view_HumRes_Articles) and they are in some of your documentation platform (e.g. Confluence). Then you can copy your data dictionary into excel file. The only thing you must consider is your table name and fields should be in first column, as this program is developed in this way.

Quick Steps

💡 Step 1:
Install python and required libraries by running:—-Install_python_and_libraries.bat —-

💡 Step 2:
Run main bat file to run the python script —run_python_script.bat—

💡 Step 3:
Enter the password for your database.

💡 Step 4:
Enter the name of the schema you want to be in

💡 Step 5:
Save the report in your desired location ( give the report name– e.g. HumRes_report)

It will save the report in .xlsx format.

Note: When executing the process second time, ignore first step.

Detailed Steps, and what’s inside it?

Step 1: Install Python and Libraries

Program Name: Install_python_and_libraries.bat

Libraries required are: PostgreSQL (psycopg2) and .xlsx reader (openpyxl)

This program installs the python, required libraries, and then sets the environment variable.

@echo off
setlocal enabledelayedexpansion

REM Define variables
set "python_version=3.12.2"
set "install_path=C:\\Python312"
set "libs=openpyxl psycopg2"

REM Install Python
echo Installing Python %python_version%...
curl -o python_installer.exe <https://www.python.org/ftp/python/%python_version%/python-%python_version%-amd64.exe>
start /wait python_installer.exe /quiet InstallAllUsers=1 PrependPath=1 TargetDir=%install_path%
del python_installer.exe

REM Install Python libraries
echo Installing Python libraries...
%install_path%\\Scripts\\pip install %libs%

REM Set environmental variables
echo Setting up environmental variables...
setx PATH "%install_path%;%install_path%\\Scripts;" /M

echo Installation complete.

Step 2: Run batch file

Program Name: run_python_script.bat

This programs runs the python script , note that you can create a test folder : C:\test, and put every programs and your .xlsx data dictionary in the same folder.

@echo off

REM Get the current directory of the script
set "script_folder=%~dp0"

REM Change the directory to the script's folder
cd /d "%script_folder%"

REM Run the Python script
python script_name.py

pause

Above batch program runs following python script:

script_name.py

import openpyxl
import psycopg2
from psycopg2 import sql
from tkinter import filedialog
from tkinter import Tk
from datetime import datetime
import getpass  # Import the getpass module

# PostgreSQL connection details
db_server = "enter your server" #e.g. yourwarehouse.yourdomain.com
db_port = 1234 # replace 1234 with the port number
db_name = "Enter database name" # exampleDataWarehouse
db_user = "enter username" #e.g. postgres1
db_password = getpass.getpass("Enter your database password: ")  # Prompt user for password

# Function to check table existence in a specific schema
def table_exists_in_schema(cursor, schema_name, table_name):
    query = sql.SQL("SELECT EXISTS (SELECT 1 FROM information_schema.tables WHERE table_schema = %s AND table_name = %s)").format(
        sql.Identifier(schema_name), sql.Identifier(table_name))
    cursor.execute(query, (schema_name, table_name))
    return cursor.fetchone()[0]

# Function to check field existence in a table in a specific schema
def field_exists_in_schema(cursor, schema_name, table_name, field_name):
    query = sql.SQL("SELECT EXISTS (SELECT 1 FROM information_schema.columns WHERE table_schema = %s AND table_name = %s AND column_name = %s)").format(
        sql.Identifier(schema_name), sql.Identifier(table_name), sql.Identifier(field_name))
    cursor.execute(query, (schema_name, table_name, field_name))
    return cursor.fetchone()[0]

# Custom function for case-insensitive schema name conversion
def convert_to_valid_schema(user_input):
    valid_schemas = [ "HumRes", "Finance", "Production", "Marketing", "Test"]
    lowercased_input = user_input.lower()
    
    for schema in valid_schemas:
        if lowercased_input == schema.lower():
            return schema

    print("Invalid schema name. Please enter a valid schema name.")
    exit()

# Connect to PostgreSQL
try:
    # Prompt user for the schema and convert to valid format
    user_schema = convert_to_valid_schema(input("Enter the table schema e.g Finance, HumRes: "))

    # Open file dialog to select Excel file
    root = Tk()
    root.withdraw()  # Hide the main window
    xlsx_file_path = filedialog.askopenfilename(title="Select Excel file", filetypes=[("Excel files", "*.xlsx")])

    connection = psycopg2.connect(
        host=db_server,
        port=db_port,
        database=db_name,
        user=db_user,
        password=db_password
    )

    with connection.cursor() as cursor:
        views = []
        missing_tables = set()

        # Set to store unique tables and fields encountered in the Excel file
        excel_tables_and_fields = set()

        # Read Excel file and check table and field existence
        workbook = openpyxl.load_workbook(xlsx_file_path)

        for sheet_name in workbook.sheetnames:
            sheet = workbook[sheet_name]

            table_name = None
            skip_table_description = False

            for row in sheet.iter_rows(min_row=2, values_only=True):
                cell_value = row[0].strip() if row[0] else None

                if cell_value and cell_value.startswith("view_"):
                    table_name = cell_value
                    skip_table_description = True
                    continue

                if table_name and cell_value == "Field":
                    skip_table_description = False
                    continue

                if skip_table_description:
                    continue

                if table_name and cell_value:
                    field_name = cell_value.strip()
                    if table_name.startswith("view_"):
                        category_list = views
                        excel_tables_and_fields.add((table_name, field_name))
                    else:
                        continue

                    table_existence = table_exists_in_schema(cursor, user_schema, table_name)
                    category_list.append((table_name, table_existence, field_name,
                                          field_exists_in_schema(cursor, user_schema, table_name, field_name) if field_name else None))

        # Check for missing tables and fields in the data warehouse
        cursor.execute(sql.SQL("SELECT table_name, column_name FROM information_schema.columns WHERE table_schema = %s"), (user_schema,))
        warehouse_tables_and_fields = set(cursor.fetchall())

        missing_tables_and_fields = warehouse_tables_and_fields - excel_tables_and_fields

        # Print results based on category
        print("\\nResults:")
        print("\\033[94mViews:\\033[0m")

        # Print existing views in alphabetical order
        views.sort(key=lambda x: x[0])
        for table, exists, field, field_existence in views:
            print(f"{table}: {'\\033[92mExists\\033[0m' if exists else '\\033[91mDoes not exist\\033[0m'}")
            if field:
                print(f"  - {field}: {'\\033[92mExists\\033[0m' if field_existence else '\\033[91mDoes not exist\\033[0m'}")

        # Print missing tables and fields for views in alphabetical order
        missing_tables_and_fields = sorted(
            [item for item in missing_tables_and_fields if item[0].startswith("view_")],
            key=lambda x: (x[0], x[1])
        )
        print("\\n\\033[94mMissing Tables and Fields:\\033[0m")

        current_table = None  # To keep track of the current table being printed
        for table_name, field_name in missing_tables_and_fields:
            if table_name != current_table:
                if current_table:
                    print()  # Add a newline between different tables
                current_table = table_name

            if field_name:
                print(f"{table_name} - {field_name}: \\033[91mMissing in Excel file\\033[0m")
            else:
                print(f"{table_name}: \\033[91mMissing in Excel file\\033[0m")

        # Create a new Excel workbook
        output_workbook = openpyxl.Workbook()

        # Remove the default "Sheet" created by openpyxl
        default_sheet = output_workbook["Sheet"]
        output_workbook.remove(default_sheet)

        # Create sheets in the workbook
        views_sheet = output_workbook.create_sheet("Data Dictionary")
        missing_tables_sheet = output_workbook.create_sheet("Data Warehouse")

        # Add header rows to the sheets
        views_sheet.append(["Table", "Table Existence in DWH", "Field", "Field Existence in DWH"])
        missing_tables_sheet.append(["Table", "Field", "Status"])

        # Print results based on category to the 'Views' sheet
        for table, exists, field, field_existence in views:
            views_sheet.append([table, 'Exists' if exists else 'Does not exist', field, 'Exists' if field_existence else 'Does not exist' if field else ''])

        # Print missing tables and fields for views to the 'Missing Tables and Fields' sheet
        for table_name, field_name in missing_tables_and_fields:
            missing_tables_sheet.append([table_name, field_name, 'Missing in Excel file' if field_name else 'Missing in Excel file'])

        # Prompt user to choose where to save the .xlsx file
        output_filename = filedialog.asksaveasfilename(defaultextension=".xlsx", filetypes=[("Excel files", "*.xlsx")])

        # Save the output Excel file
        output_workbook.save(output_filename)
        print(f"\\nResults have been saved to '{output_filename}'.")

except psycopg2.Error as e:
    print(f"\\033[91mError: {e}\\033[0m")

finally:
    if connection:
        connection.close()

Step 3: Enter the Schema

You are required to enter the schema name to compare the data dictionary with data warehouse. If you want to validate Human Resource Data Dictionary with Data Warehouse, then enter schema name e.g.‘HumRes’

Step 4: Select Data Dictionary

In the data dictionary within the XLSX file, the tables and fields must be placed in the first column for this program to function correctly.

How to create it?

You can copy and paste your data dictionary(e.g from Confluence) into an Excel sheet (.xlsx) and provide the file name for the associated data dictionary module.

For example, I have “HumRes_Data_Dictionary,” and structured it as follows.

Note: In this structure, tables and fields are in the first column. The program disregards table descriptions and headers but reads the table name and its fields.

After selecting the file, you will get following result with save as window:

Enter the name for the report (e.g. HumRes_report) and select the folder you want to save it. it will then saves the report in .xlsx format with multiple sheets.

The file is saved in following location:

Step 5: View the report

The report will have two sheets

  1. Data Dictionary: The report in this section contains information whether the tables and fields that exists in data dictionary exists or not in Data Warehouse (Compares Data Dictionary against Data Warehouse)
  2. Data Warehouse: It provides report about the missing tables and fields if they are not in data dictionary. (i.e. compares Data Warehouse against Data Dictionary)

Conclusion:

In summary, the automated Data Dictionary validation streamlines and improves the validation process, empowering team members to effectively utilize the program. It addresses challenges such as manual validation, ETL code regression, ambiguity in validating tables and fields, and inconsistent documentation practices. The user-friendly interface and detailed reports contribute to enhanced efficiency and data governance, providing clear insights for team members and ensuring the reliability of the validation system


Discover more from Data Engineer Journey

Subscribe to get the latest posts sent to your email.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Discover more from Data Engineer Journey

Subscribe now to keep reading and get access to the full archive.

Continue reading