Hybrid or Not – A case study in Enterprise Data Sharing

Most Enterprises today struggle with getting value out of organizational data that is distributed across organizations. Let us look at a example to get an understanding of ‘why’ and then let’s look at a potential solution.

In a large insurance company, a group of data-scientists belonging to a central organization is tasked with a project. They need to determine if replacing the fleet of company-issued cars with plug-in hybrids will save the company money. For an accurate assessment, they need data from an internal application that logs the trip data of these cars: Plug-in hybrids do very well when used mostly for short trips.

Now, this data is sensitive. Because these cars are authorized for reasonable personal use, the above data also contains personal employee trip data, including exact GPS coordinates of each trip. Releasing this data as-is to the data-scientists would be a violation of employee privacy

What follows next is usually something like this:

  1. The data scientists file a help-desk ticket with data-engineering, requesting the desired data. Because the data-engineering team is overloaded, it is a few days before they get to the ticket
  2. Since they don’t really understand the semantics of the data being requested, they need to meet with the owner of the data to understand what they’re looking at. Based on this conversation and the contents of the help-desk ticket, they author an ETL pipeline with a transform that adds a ‘fuzz’ to the GPS coordinates in order to hide the specific address while still keeping the location accurate enough for the task at hand.
  3. Finally, weeks after filing the request, the data is available to the data scientists. But….it’s not just data scientists working on the project who have access to it. Lacking a convenient way to limit access to just the project team, the data engineers have made the data visible to everyone in the data-scientist organization. People who have nothing to do with the project in question have access to data, potentially in violation with regulations like the GDPR
  4. By now the project is running behind schedule but after a few late nights at the office, the team is able to determine that switching the fleet to hybrids would indeed save the company money. Mission Accomplished.
  5. They move onto another project. The data is forgotten about and the data-pipeline keeps running each night, moving around globs of data that no one needs and making it visible to people who should not have access to it.
  6. Rinse-repeat for a few projects. The data engineering team continues to run from pillar to post servicing requests for data that they don’t really understand or own. There are now hundreds of data pipelines running each night, racking up large infrastructure costs. The data lake contains terabytes of data with loose access controls and and no one knows which data is being used by whom and for what.

Clearly, not a happy situation. It impairs data-agility, is heavy on data-engineering and infrastructure costs and creates risk.
But, it does not have to be like this. Let’s look at how things work when the uses Elten for inter-organizational data-sharing

  • The data scientists directly request the data-owner for the data.
  • The data owner logs into Elten. With a few clicks, she adds a rounding function to the GPS coordinates column to hide the exact addresses. With a few more clicks, grants the project team access to the data.
  • A few mins later, the project team (and only the project team) gets the data and can start analyzing it, while Elten monitors and audits every access to it.
  • At the end of the project, Elten automatically revokes access and cleans up the data

Reduced Data Engineering workload, reduced infrastructure costs, significantly reduced risk surface area AND faster time-to-insight.

Find out more at www.elten.io

Leave a comment