Semi-Structured Data Pipelines: How to Mitigate Risk and Accelerate Innovation with Snowflake
Exploring the unknown challenges that contribute to downstream impact and risk—and how to identify and solve them with DataPancake™ from TDAA inside of Snowflake.
Written by Jon Farr, Founder and CEO of TDAA
Introduction
Creating and maintaining data pipelines for semi-structured data is one of the most expensive, time consuming, and error-prone tasks that data teams face today. The process is fraught with risks, including inaccurate or incomplete data, which can lead to negative impacts to downstream systems.
Data teams face many challenges when tasked with accurately discovering the schema for a semi-structured data source and then producing the SQL required to flatten it, including:
- Polymorphic data: Handling evolving attribute structures and data types over time.
- Stringified and escaped JSON and similarly encapsulated complex data objects across various semi-structured formats—XML, Parquet, Avro, ORC.
- Schema evolution & drift: Detecting schema changes, preserving historical context, and ensuring compatibility across data versions.
- Flattening hierarchical data: Ensuring path reference accuracy, handling foreign keys in nested structures, and processing deep object hierarchies.
These challenges must be overcome before teams can focus on higher value activities such as data cleansing, enhancing and enriching. Yet for many companies, this is a constant point of friction due to time, people, and tool constraints. As these challenges add up, the task of maintaining existing data pipelines and creating new ones becomes extremely difficult to scale, impacting the data team’s ability to support the needs of the business.
Today’s data management tools and integration platforms are not designed to comprehensively address the complex issues that engineers face when processing semi-structured data.
Every company working with this type of data understands that the process of creating and maintaining these types of data pipelines is often resource, cost, and time prohibitive. Snowflake provides functions like LATERAL FLATTEN, OBJECT_KEYS, and PARSE_JSON, but these require ongoing manual intervention to handle schema drift, polymorphic structures, and nested JSON complexities, resulting in fragile pipelines that break as data changes. As a result, many companies are unable to fully leverage this data, which can lead to operational inefficiencies, lost revenue, or increased opportunity costs.
Due to these challenges, companies resort to only creating pipelines to extract slices of data that meet their immediate needs. In doing so, they create tremendous amounts of technical debt and increase the risk of downstream data quality issues which can damage trust with the business, partners, and customers.
This article explores the root causes of these data challenges and demonstrates how DataPancake uniquely and effectively solves them to accelerate the availability of accurate data. Unlike manual approaches that require constant maintenance, DataPancake ensures complete schema discovery, accurate transformations, and long-term pipeline stability—eliminating hidden data loss and inefficiencies.
1. Polymorphic Data
⚠️ The polymorphic data problem
Most data engineers have experienced challenges when processing JSON. First, because JSON is schema-less, the data types used for individual attributes can change over time. An attribute named “address” can use a data type of string to store a value like “123 Main St.” After a schema change, the same attribute can use a data type of an object array. The objects stored in that array could be highly complex, containing other nested objects and additional arrays containing multiple addresses with other related data. The JSON format has no schema restrictions to prevent the data types of attributes from changing over time. As these data type changes occur, polymorphic data will be created.
The impact of polymorphic data can result in broken downstream processes that are expecting attributes of a certain data type. Worse, if the parsing and flattening process does not handle the data type change appropriately, some data can be lost before it ever reaches the downstream processes relying on that data.
With no exception handling to notify the data team, the lost data may go undiscovered until an internal consumer or customer discovers it.
💡 Why it matters
Unnoticed polymorphic data can corrupt reports, ML models, and business insights. If your pipeline isn’t designed to detect evolving structures, you risk silent data loss, inaccurate aggregations, and misreported metrics—issues that can take weeks to detect and months to fix.
✅ How DataPancake solves these challenges
DataPancake solves these issues by using its innovative and patent pending process to discover the complete schema of a data source, even if the schema is highly complex and deeply nested. DataPancake accurately and quickly identifies all of the data types used for each attribute, ensuring complete schema discovery for accurate data pipelines.
When the data is flattened, each polymorphic version of every attribute will be extracted as its own derived column in the generated SQL. This provides clear visibility into all of the data and eliminates hidden inaccuracies.
2. Escaped/Stringified JSON
⚠️ The escaped or stringified JSON Problem
The second major challenge associated with a semi-structured data format is that attributes defined as strings can contain JSON objects that have been stringified (converted into a plain text string representation for easier storage or transmission) or escaped (with escape characters inserted to ensure literal interpretation).
This situation isn’t limited to JSON data; any valid semi-structured data format such as XML, Parquet, Avro, ORC, or JSON can include string fields that hold a stringified or escaped complex JSON object with its own schema. If the schema of that stringified JSON is not documented, then the schema must be fully and completely discovered. Notably, Kafka message data streamed into Snowflake can be stored as an escaped JSON string.
The challenge is that each record in the data source can have one or more attributes that contain a unique escaped JSON string. Further complicating matters, each escaped JSON string can also contain nested strings that further contains additional escaped or stringified JSON.
To create an accurate data pipeline, each attribute containing this type of JSON for each record would need to be discovered and flattened, a task that’s virtually impossible to do quickly.
💡 Why it matters
If the JSON data flowing through your pipelines contains stringified JSON, and your processing logic is not aware of it, then there is a high likelihood that you have downstream data quality issues that are currently impacting the accuracy of every process that relies on that data.
✅ How DataPancake solves these challenges
DataPancake’s patent-pending scan discovers all stringified or escaped JSON inside of any string attribute of any format. The stringified JSON discovery provided by DataPancake is also recursive, so if a stringified or escaped JSON object exists inside another stringified JSON object, it will be discovered and its schema will be identified and flattened.
The result is clearly visible arrays and attributes showing the entire schema, including stringified or escaped JSON.
3. Schema Evolution (a.k.a. Schema Drift)
⚠️ The schema evolution problem
As businesses evolve—whether due to market shifts, operational improvements, or technological advancement—their underlying data sources and structures inevitably change.
Organizations that have change management policies in place know how to handle changing schemas and any required associated data migrations. However, for various reasons, not all data is accurately migrated or updated after a schema change.
This legacy data will create challenges for organizations when they are required to manage and process historical data alongside current data when multiple schemas are used.
Often, an organization’s data will be moved into one or more data repositories such as data lakes, data lakehouses, and data warehouses, and as a result, each schema change to each data source must be handled separately, but also accurately. Any missteps or errors in this change management process for data will result in quality issues downstream, impacting the data products that rely on it.
💡 Why it matters
Data lakes, lakehouses, and warehouses have the ginormous task of incorporating all versions of a data source’s schema over time. There are several risks associated with maintaining the data pipelines for semi-structured data when schema evolution or drift occurs. If the change management process for any upstream system does not include robust communication and documentation with all downstream consumers, then data quality issues are guaranteed to happen.
Data quality issues resulting from schema evolution may take the shape of one of the following: disruptions to existing production data pipelines which interrupts the flow of data, missing data due to changed or additional paths, new stringified JSON data, or inaccurate data due to an attribute’s changed data type. Several of these data quality issues can have a significant negative impact on many downstream data products including reporting, analytics, machine learning, GenAI, among others.
✅ How DataPancake solves these challenges
DataPancake solves the issue of schema evolution and drift by scanning the dataset for a given data source to automatically detect and record any schema change. This detection applies to all historical data as well as any newly produced data. Users can configure how often their data is scanned using a custom CRON schedule. If needed, DataPancake can be further configured to automatically regenerate the SQL required to flatten semi-structured data due to any schema changes.
The DataPancake scanning feature is available in the Snowflake Marketplace at no cost, and the Snowflake compute cost for scanning this data is minimal, especially considering the value provided.
One million rows in a Snowflake table with a single VARIANT column containing 1 MB documents can be scanned for just a few dollars of Snowflake compute cost.
DataPancake also provides the appropriate metadata after each completed scan. This enables customers to create Snowflake alerts that notify the appropriate personnel when schemas have changed or when new polymorphic versions of an existing attribute have been added.
Customers can configure DataPancake to scan data in lower environments as well. This allows customers to know of potential schema changes before their production environment is ever affected.
4. Flattening Hierarchical Data
There are several risks associated with flattening hierarchical data such as JSON:
〰️
⚠️ The path referencing problem
If an incorrect path is mistakenly referenced when flattening any type of semi-structured data in Snowflake, an exception will not be generated during SQL execution. Instead, the =path reference error would only return a null value.
💡 Why it matters
When these silent errors occur, there is no way of knowing whether the data is actually null or whether a path reference error has occurred. These types of errors can lead to significant data quality issues and wasted troubleshooting efforts. Once again, these silent errors often remain invisible until a critical system fails or inaccurate metrics make their way into business decisions, damaging trust and outcomes.
✅ How DataPancake solves these challenges
DataPancake generates 100% accurate SQL for creating dynamic tables which ensures path reference accuracy and eliminates any silent parsing and flattening errors.
〰️️
⚠️ The foreign key problem
The inability to correctly identify and configure foreign key values into nested arrays as they are flattened can be challenging. While it is straightforward to store relational data within tables of a relational database, data formats such as JSON are not designed for this purpose. Specifically, a nested array, object or primitive, will likely not contain a value that relates it to the parent object it is nested under. As a result, foreign key values need to be inserted in order to join database objects together after flattening.
💡 Why it matters
Many times, a surrogate key will not exist, but instead, one or more attributes in the parent object will together make a compound key. If the foreign keys are not inserted correctly, it will be impossible to join the flattened database objects together making it difficult to make full use of the data.
✅ How DataPancake solves these challenges
DataPancake enables the configuration of one or more foreign keys for all nested arrays. This allows for an accurate relational design enabling the joining of all flattened database objects for downstream consumers.
〰️
⚠️ The nested object and array depth problem
JSON and other hierarchical data formats do not have a limit to the level of depth a nested array can be inserted at. Without proper discovery, it is easy to miss entire objects which can result in a loss of data. Because the schema of a JSON object can change from one row to another, it is critical that each JSON object’s schema be discovered independently through the entire data source.
💡 Why it matters
If each nested object and array is not identified then data loss will be possible leading to substantial data quality issues.
✅ How DataPancake solves these challenges
DataPancake recursively scans the entire document in each record to discover every possible attribute regardless of how deeply nested it is.
〰️
⚠️ Destination data type accuracy problem
It is possible for a single attribute to store date, time, or timestamp values in multiple formats. If this data is not parsed, flattened, and transformed into the correct destination data type, then data loss will occur resulting in additional data quality issues.
💡 Why it matters
Incorrect format can lead to incorrectly interpreted data. For example, if the time zone information is lost or the scale portion of a decimal value is left off because of a conversion to an integer, data can become inaccurate leading to additional and serious data quality issues.
✅ How DataPancake solves these challenges
DataPancake infers the correct destination output data type and if applicable, the correct datetime format, and supports multiple datetime formats per attribute.
It will also detect the difference between integer and decimal data types and will even support an attribute containing both data types at the same time.
5. Unlock Immediate Value with DataPancake
Simplifying the medallion architecture for semi-structured data pipelines
DataPancake enables each polymorphic version of an attribute to be transformed during the flattening process as part of the dynamic table. It also allows for the configuration of additional transformations in the data governance semantic view layer where attribute polymorphic version consolidation can occur, such as combining an int, float, and string into one column, as well as calling a custom UDF or an external function.
Using DataPancake to validate existing semi-structured data pipelines
DataPancake can be used today to quickly and inexpensively scan and discover the full and complete schema for an entire data source, providing validation of existing semi-structured data pipelines. DataPancake’s free product tier provides summary schema statistics for selected data sources and allows customers to compare their currently known schemas to the actual schemas discovered. It provides users with the following information for each scanned data source:
- Total # of Attributes
- # of Schema Originated Attributes (if comparing to a JSON or XSD schema)
- # of Discovered Attributes
- # of Polymorphic Attributes
- # of Arrays
- # of Objects
- # of Attributes that contain Stringified/Escaped JSON
- Max Level of Depth
If there is a discrepancy between what DataPancake discovers and what the current data pipeline is processing, users can upgrade to the Data Analysis product tier to gain access to the schema details and very quickly determine the exact differences between the two.
If users want to locate the specific data that is missing from their current data pipeline they can upgrade to the Data Source product tier to generate dynamic tables and view create SQL statements. Once deployed, users can then select data from the deployed dynamic tables and views to locate the specific data in question.
Take the step to resolve your data quality issues
DataPancake eliminates the risks associated with managing semi-structured data pipelines and the flattening of hierarchical data. DataPancake will revolutionize and transform the process of creating and maintaining data pipelines for semi-structured data and ensure data accuracy, quality, and consistency.
🔸 Try DataPancake (a Snowflake Native App) for free in your Snowflake environment today
📩 Questions? Reach us at contact@datapancake.com