BACK TO INSIGHTS

Data ingestion and storage in the lakehouse

24 Jul 2024

Data ingestion and storage in the lakehouse

3 min. read

Data is valuable and how an organisation ingests and organises it is crucial to gaining comprehensive insights. Microsoft Fabric’s OneLake allows business to benefit from this data in a unified, simple, and secure Lakehouse.

As we’ve discussed in the essentials of an AI Estate, data architecture requires a robust and comprehensive data ingestion and storage scheme, known as a data lakehouse. This lakehouse must be open, unified, flexible, easily manipulable without loss, and above all, secure.

In the information economy, data is almost as valuable as currency. Though Facebook does not charge a fee to use its platform, its core revenue is using its data to laser target advertisements to those willing to pay for it; demographic data, location data, device data, preference data, third-party data, and so on.

One could argue that data is even more valuable than the capital expended to capture it, as breaches or leaks of data can land businesses with government fines, legal action, and reputational damage beyond repair.

According to the Australian Productivity Commission:

“For the past 30 years, data has played an increasingly central role in both the economy and society. Stemming from large scale digitisation, connectivity, smart devices and computing infrastructure, data has become a key input into production and innovation. And while data use can generate significant value across the economy, doing so inevitably raises questions about who owns, controls and should benefit from data.”

Business leaders, managers, and users depend on data to drive growth, improve decisions, and operate more efficiently with actionable insights.

Using the capabilities required of an AI Estate as the ideal architecture for an AI-driven, data-centric business, we must examine how data is ingested or captured by the system and how it is stored to follow through on the points above for positive business outcomes.

OneLake and the Lakehouse

According to Microsoft, “OneLake is essentially the OneDrive for data within the Fabric ecosystem,” the metaphor being OneDrive is personalised, secure storage for every user within the system requiring no additional infrastructure management (e.g. setting up virtual drives etc.).
OneLake is designed to seamlessly connect with cloud architecture supplying data. It’s open as all personas attached to the data such as developers, engineers, or analysts, can directly access the data in OneLake. It also uses the open-source delta file format, which optimises storage for data engineering workflows. It supports efficient storage, versioning, schema enforcement, ACID transactions, and streaming.

OneLake is unified and flexible as it used symbolic links and shortcuts to simplify navigation without being “destructive” (e.g. one user erases or changes the data for others.)

Users may set up individual lakehouses, with ingestion taking place from shortcuts. The ingestion can be controlled via tables and queries. Analysis and reporting, which may have required specialist coding knowledge in the past, is made easier via Microsoft’s Copilot AI, which uses natural language queries to achieve outcomes (e.g., “How many units of X did we sell over the last three months to under-25s"). Locating files is also simplified with Copilot AI (e.g., call up all data relating to January 2024.)

As for data security, this is handled by Microsoft Purview, a key component of Fabric, in addition to the the usual security controls like identity and access management, role-based access controls. It enables system-wide information security governance through domains and data loss prevention, powered by AI labels and classification systems.

Fabric also has data science and engineering tools built into the Lakehouse and Fabric, known as Synapse.

How we can help

Irada assists growth-minded organisations apply practical, aligned, and robust AI solutions in business operations.

References, Resources, Readings

Productivity Commission 2024, Making the most of the AI opportunity: AI raises the stakes for data policy, Research paper, no. 3, Canberra. p.2

Links to external websites were correct at the time of publishing. Articles may be behind a paywall. Irada is not responsible for the content of external websites.

The information in this article is general in nature. Your circumstances may vary.

This work is licensed under CC BY-NC-SA 4.0 

How we can help

Irada was founded to assist growth-minded organisations apply practical, aligned, and robust AI-infused solutions to business task automation. Get in touch at 1300 247 232 or info@irada.com.au