As you progress in your data journey, whether you’re transitioning from a data analyst to a data engineer or simply enhancing your data skills, exploring online certifications like IBM Data Warehouse Engineer or Azure Data Engineering Associate will expose you to a wide array of data topics. While mastering every detail may not be necessary, staying updated on these subjects can significantly elevate your expertise.
Note: I’ll be adding more relevant topics as I learn along the way.
Topics:
| Topics | Link | Comments | Tags |
|---|---|---|---|
| Schema ON read/write | Read/Write Schema | Read – Traditional RDBMS Write – Data Lake | Data |
| GDPR | General Data Production Regulation | Regulation on information privacy in the European Union and the European Economic Area. | Privacy, Security, Governance |
| HIPAA | Health Insurance Portability and Accountability Act | The Privacy Act 1988 is largely the Australian counterpart to HIPAA. | Privacy, Security, Governance |
| PCI DSS | Payment Card Industry Data Security Standard (PCI DSS) | Global standard mandated by the leading Card Schemes including Visa and MasterCard to reduce the risk of card data breach. | Privacy, Security, Governance |
| Performance Tuning in Talend | Talend Performance Tuning | Identify and eliminate bottlenecks | Talend, ETL |
| Slowly Changing Dimension (SCD) | Slowly Changing Dimensions (SCDs) | Type 1 – Overwrite the changes Type 2 – History will be added as a new row Type 3 – History will be added as a new column Note: type 2 is an idempotent type | Data |
| Change Data Capture (CDC) | Change Data Capture (CDC) | Extracting data in real-time or near-real-time | Data |
| Idempotent Data Pipelines | Idempotent Data Pipelines | The ability to execute the same operation multiple times without changing the result Not Idempotent → INSERT INTO without TRUNCATE – may store the same data twice. | Data |
| Analytics Engineering | Analytics Engineering | Data | |
| Serverless Computing | Serverless vs Containers | Cloud provider manages the infrastructure and automatically allocates computing resources as needed to run applications. Examples: Azure Functions, AWS Lambda | Computing |
| Container | Containers vs VMs | Virtual machines provide an abstracted version of the entire hardware of a physical machine, including the CPU, memory, and storage. Containers are portable instances of software with its dependencies that run on a physical or virtual machine. | Computing |
| Virtual Machines | Containers vs VMs | Virtual machines provide an abstracted version of the entire hardware of a physical machine, including the CPU, memory, and storage. Containers are portable instances of software with its dependencies that run on a physical or virtual machine. | |
| Agnostic Storage | Data agnostic A device or program that can receive data in multiple formats or from multiple sources, and still process that data effectively. Hardware agnostic Software that can work with a variety of hardware components, without being limited to a specific set of configuration options. This can make data storage more flexible and protect data in case of hardware failure. |
Tools
Tools:
| Tools | Link | Comments |
|---|---|---|
| Apache Parquet | Apache Parquet | Column-oriented data file format |
| Docker | Docker | Tool for containers Remember: the difference between containers vs virtual machines; similarly, Docker vs VMware Kubernetes – container orchestration system for automating software deployment, scaling, and management. |
| MariaDB | MariaDB | Fork of MySQL with advanced features |
