Insight

Understanding dimensions/reference data, late arrivals, and out of order events in Real Time Intelligence

One of the interesting things I find about event driven architectures in Microsoft Fabric is how with the convergence of worlds, different terminologies that meant the same thing have now come to light. For example, take dimensions in classic data warehousing scenarios. In stream

2025-09-16
insightsarchitecture-strategyarchive

One of the interesting things I find about event driven architectures in Microsoft Fabric is how with the convergence of worlds, different terminologies that meant the same thing have now come to light. For example, take dimensions in classic data warehousing scenarios. In streaming data scenarios, we call this contextualization, which essentially means to apply reference data to the stream of events coming in. Since events are really just time series facts that are generated, some of the classic challenges that we have dealt with for a long time can be handled in new, more efficient ways. This week's article dives deep into dimensions in classic data warehouse scenarios and how to handle these with real time intelligence in Microsoft Fabric.

For this article, I'm going to focus on type 1 and type 2 dimensions. Although there are other dimension types, this addresses the most common dimensional use cases. For a quick refresher, type 1 dimensions are updates of records, and type 2 is when history is stored via Start/Begin effective dates (or Begin/End Effective; ValidFrom/To; whatever you call it). For the skilled BI practitioner, this is not only part and parcel of the day to day, but it also comes with a particular set of requirements with the business. I'm not going to state it all here as it has been covered many times before, but creating both type 1 and type 2 dimensions requires updates of existing rows via either simple column to column comparison or hash key generation. Whether you use Spark, SQL, or some other tool, the patterns and practices are the same.

Type 1 and Type 2 Dimensions

In event driven architectures, particularly in Real Time Intelligence, these constructs are not referred to as "dimensions". This is called either reference data, context data, or in some scenarios, "tagging" the data. Regardless of whichever terminology is used, the core goal remains the same: Providing descriptive definitions around the stream of event-based data. How these descriptive definitions get applied is where we can start to leverage the power of Eventhouse in Real Time Intelligence.

Did you know that Eventhouse in Fabric adds an ingestion time to every row that is inserted into the database by default? This is very useful, especially for tracking history. For starters, this makes us begin to question certain dimension specific requirements. Why do we need a start and end history effective date if the database by default tracks this? Snapshotting the dimension record daily gives us built in type 2 historical attribute reporting, so why do we need to build complex logic in ELT pipelines to handle this? It is far more efficient on a database engine to do a straight insert. For those coming from the data warehousing world, think of Eventhouse as a built in engine that creates a snapshot fact table on every table every time it's loaded.

For "type 1" dimensions, this also holds true. If you are not interested in the history of the data, you can simply delete the older loads. Anything older than today's snapshot simply gets deleted. Depending on how often you load this data, simply set the retention policy of Eventhouse to that duration. For example. If you are snapshotting a table from SQL as a dimension every 15 minutes and loading it into Eventhouse, setting the retention policy to 15 minutes simply deletes the previous runs of data. (Depending on your tolerance you may choose to keep the previous 2 or 3 snapshots and simply aggregate to the latest value when querying). The flexibility of the engine, however, shows us that regardless of the approach chosen it's easily handled and greatly simplified.

Now, you might be wondering if this is the case, won't the database just fall over if you snapshot every dimension every day? When applying type 1 logic, there is no need to keep history at all. Just grab the latest value each day and set your retention policy. When applying type 2 logic, you are leveraging the power of the engine again. Remember that Eventhouses use Extents, not traditional SQL and column store constructs, and so this leads to difference in implementation patterns. Remember too that within Eventhouse, not all data has to be kept hot. The charge for storage in cold cache Eventhouse storage (the delta between the caching and the retention policy) is exactly the same as the storage costs for OneLake.

You may choose to keep the last 30 or 60 days of data for "type 2" dimensions hot, and then the rest of the data in the table kept in the cold cache to look at a record when it was active. I would argue this is simpler than traditional Type 2 implementations. Trying to create a query to "find all of the product records that were active on a specific day in history" is not easy in SQL, because you have to search between the start and end dates. In Eventhouse, simply query | where ingestion_time() = <whateverdateyouwanttoseehistoryfor>. Much simpler, and very performant (you are only querying a few extents, because extents by default are partitioned on ingestion_time).

Loading Reference Data into Eventhouse

Within Fabric, there are a multitude of ways to load reference data into Eventhouse:

  • Pipelines
  • Shortcuts
  • Eventstreams (yes, even reference data can be streamed!)
  • Notebooks

Because reference data typically moves at a much slower pace than pure operational data, the speed at which it can be loaded can become slower too. This could be as quick as real time, if the needs dictate it, to daily batch loads. However, loading this data into Eventhouse allows us to create these joins and contextualization either at query time, or through an update policy. I'm not going to walk through every example and every scenario, but you can get lots of information on this in the article Medallion Architecture for Real Time Intelligence.

The crux is that we've simplified previously complicated upsert logic, without the need to create columns to track history. Because everything is treated as an event, everything gets logged. This is very useful, especially when it comes to handling things like late arriving facts events. This brings us to our next topic.

Late Arriving Facts/Events

Similarly to how event publishing may lose connectivity or have a stream interrupted from some upstream issue; time-series based fact tables may not get a feed for hours or a day or two, or during a data quality check users realize that an incorrect file was sent. This requires that data has historically happened be loaded into the table. Commonly, I see users try and do this join directly within the Eventstream. When data is arriving late, it is far more efficient to write this event to the storage engine and then do the join. The join can then be done either at query time, or through update policies or materialized views.

Fortunately, KQL gives us some very rich capabilities when it comes to dealing with this type of issue using some time series operators (it is a time series database after all 😊). Take the following example. We have a product dimension that has changing attributes daily, and we then receive a feed of data that has a bunch of sales that have happened over the past 3 days. We want to identify what the corresponding record that had the reference/dimension data active at the point in time that the sale occurred. This can be accomplished with the following query:

let Clickstream = datatable(user_id:string, event_time:datetime)
[
    "1", datetime(2025-09-07 14:23:00),
    "2", datetime(2025-09-07 15:45:00),
    "1", datetime(2025-09-08 10:00:00)
];
let User = datatable(user_id:string, browser:string, pc_type:string, record_date:datetime)
[
    "1", "Chrome", "Desktop", datetime(2025-09-07),
    "1", "Firefox", "Laptop", datetime(2025-09-08),
    "2", "Edge", "Tablet", datetime(2025-09-07)
];
// Join clickstream events to the latest user snapshot on or before the event_time
Clickstream
| join kind=inner (
    User
) on user_id
| where record_date <= event_time
| summarize arg_max(record_date, browser, pc_type) by user_id, event_time

Interestingly, in data warehousing scenarios these were always called late arriving facts, but the concept is the same. In SQL, doing this join was not trivial, and does not perform well at scale.

Out of Order Event Handling

Another scenario you might run into when handling events is the concept of out of order event handling. In batch processing, this would be equivalent to processing multiple records with the same primary key in the same batch. Which one happened first, and how do you calculate deltas between them? In ETL architectures, this is a tricky challenge. In event driven systems, this could happen due to any number of reasons:

  • Network Latency
  • Distributed systems with asynchronous sources
  • Buffering delay
  • Clock skew

This is best demonstrated with an example. Imagine a clickstream system tracking user activity:

Tables: ClickstreamEvents, which contains a userId, EventTime, and the page they visited; and UserProfile, which contains a UserID, Browser, PCType, and the RecordDate

Situation:

  • A user visits a page on Sept 7, but the event arrives on Sept 9 due to network delay.
  • You want to join this event with the UserProfile as it was on Sept 7, not Sept 9.

The good news? You've already solved this. Whether we call late arriving data or out of order events, the solution is the same. Rather than re-pasting the query, use the query in the preceding section. At extremely high scale volumes, you may want to consider partitioning the data in Eventhouse based on the event time, not the ingestion time (as is the default), but this is not common.

Note that this is a different kind of out of order event handling as when you need to handle it at the broker level due to interconnected exchange systems when leveraging RTI as a message broker. That's a topic for another time 🙂

Alternative Scenario

Alternatively, you may continue to have a begin and end effective date stamp on your type 2 dimension. If it is processed in upstream for master data management/data quality purposes, perhaps you have already undergone the effort to create Type 2 dimensions that is tracking begin/end effective end dates, and you are then loading or creating shortcuts of the data into Eventhouse to make this reference/dimension data available. Don't get rid of this hard work, leverage it! Kusto has some nice built in operators for this, we can easily join this using the between clause in KQL:

let Events = datatable(EventId:int, EventTime:datetime, UserId:string)
[
    1, datetime(2023-01-01 10:00:00), "userA",
    2, datetime(2023-01-01 11:00:00), "userB",
    3, datetime(2023-01-01 12:00:00), "userA"
];
let Users = datatable(UserId:string, BeginEffectiveDatetime:datetime, EndEffectiveDatetime:datetime)
[
    "userA", datetime(2023-01-01 09:30:00), datetime(2023-01-01 10:30:00),
    "userA", datetime(2023-01-01 11:30:00), datetime(2023-01-01 12:30:00),
    "userB", datetime(2023-01-01 10:45:00), datetime(2023-01-01 11:15:00)
];
Events
| join kind=inner (
    Users
) on UserId
| where EventTime between (BeginEffectiveDatetime .. EndEffectiveDatetime)

Key Takeaways

The transition from traditional data warehousing concepts to event-driven architectures doesn't mean abandoning proven patterns—it means rethinking how to implement them more efficiently. Eventhouse's built-in ingestion time tracking, extent-based storage, and KQL's powerful time-series operators provide elegant solutions to challenges that were complex in traditional SQL environments.

Whether you're dealing with reference data contextualization, late-arriving events, or out-of-order processing, the key is leveraging the strengths of the platform rather than fighting against them. Embrace the event-driven mindset while maintaining the analytical rigor that dimensions and reference data provide.


If you're navigating AI applications of data, Fabric, or event-driven architectures and want a second opinion, feel free to reach out!