In this article, we will look into data lake implementation approaches to support your supply chain performance and operations.
We will cover 3 chapters:
- What is a data lake and its importance for Strategic Analysis and Data Science?
- What are the main pitfalls when implementing a data lake?
- Understanding a new approach to data lake implementation: the "process centric" approach?
What is a data lake and its importance for Strategic Analysis and Data Science?
IBM defines a data lake as a centralized repository that allows a supply chain organizations to store all their structured and unstructured data in a single source of the truth.
Within a data lake, there are two ways to manage data : raw or organized.
The preferred approach will depend on your need:
- Strategic analysis and data science,
- or running your business day to day.
Data lakes enable organizations to leverage big data technologies such as Hadoop and Spark to perform Artificial Intelligence (AI) and machine learning for their supply chain and operations.
If you'd like to leverage big data for machine learning then the data can be stored in its raw and native format, without the need for prior organization and categorization.
It provides greater flexibility and faster time to insights. For example, this can be useful for identifying potential breakdown before they occur with AI based preventive maintenance.
The alternative approach: the "organized" one.
Basically connecting similar and relevant data together. Data needs to be cleaned, structured and contextualized.
This requires business input (read business folks expertise) to match and contextualized information so they can provide insights for supply chain planners, managers and supply chain leaders.
This structured approach is necessary for everything else that is not based on big data.
For example, if a daily task might be to check the top ten stock deviations each day in your warehouse to prevent out of stocks. The tasks may require comparing data from multiple sources: the stock level in the ERP, and the forecast within the planning system (APS) as well as the order book in the CRM.
These sources of information are well known, all the data can be linked to a unique identifier: the SKU number. Thus, a data lake would be useful to connect all of the information from the different tools automatically.
Indeed, the daily stock analysis performed by the supply chain planner or the procurement manager is an easy task but daunting as it typically require extracting 3 flat files from these systems. Then, run some Excel wizardry to get the situation. Only at this point, the actual work start: understanding the situation and acting on it by engaging with the right people fast enough to address the deviations proactively.
A data lake would provide on a silver plate such information on a daily basis. This is what business people wants from a data lake. They don't want to wait for this for several years.
This issue is often linked to the business case built to justify the data lake in the first place. To justify it, businesses often leverage AI benefits.
Yet a black box algorithms will have virtually no impact on the day to day lives of most supply chain and operations employees and managers. Things are changing fast though.
What are the main pitfalls when implementing a data lake?
- Data lake initiatives often takes several years, especially in larger organizations.
- Unfortunately, larger organizations are often where there are multiple tools and systems.
- Thus, this is where the need to get an integrate supply chain data lake is most important.
- A large part of implementation slowness is due to the data integration and management as well as technical skill challenges.
- Supply chain and operations folks thus actively look for alternatives undermining the data lake effort (often heard in open spaces from business folks: "I just need a simple report and I can't wait 18 months for it!")
- Typically, during an implementation, data engineers will make tweaks to the data ingested to make it usable by business folks.
- Rarely, data are fixed at the root (ie. in other systems).
- Or when it is, the data lake implementation progress is on pause until all other systems are under better data governance.
- A data lake initiative is good to surface discrepancies between systems and offer a real opportunity to clean up data.
- Unfortunately, data quality issues will always happen in the future and risk that users distrust the whole data sets.
- Reports are made available from data lake initiatives.
- Yet, all the implementation effort is not having ANY impact on the day to day until someone manually look at the dashboards and take actions.
- They don't go from real-time analytics to real-time execution.
- Another pitfall is not having sufficient input from the business to the IT team driving the initiative.
- This is leading to reports to be technically correct but not presenting data in a way that can drive insights.
Real Time Hype
- In numerous use cases, real time data are not necessary.
- It is a "must have" if you're doing AI to predict the next equipment failure. But, it is not for many use cases.
- Many teams are not clear on this and therefore invest more time and resources than useful.
Not Future Proof
- The business needs in terms of reporting will change often as the business evolve.
- IT team have multiple priorities and will not be able to update rapidly the data lake interfaces for the business needs.
- This will lead operational workers to create new reports on their own defeating the intent.
Understanding a new approach to end to end data lake implementation: "process centric" approach?
Data lakes are must haves. You may call data lake: "single source of truth", or even "single repository data base" in your organization. The point is, to run efficiently an organization with multiple stakeholders and functions, data needs to be easily retrievable to run the day to day. Data lakes are a great foundation for it.
Data lake initiatives are often slow to implement and do not capture all the business needs. They are often tech driven as opposed to business driven. In a world when business folks needs new tool fast, this is too slow.
This creates frustration on the business side as well as on the IT side. Often business folks underestimate the challenging activities in data engineering required to get the data lake shipped. Yet, they may only need 10% of that effort to fulfill their basic reporting needs. Big frustration indeed.
This is where a new approach comes in.
This new approach to data lake initiative can be used to quicken the implementation time, we call it the "process centric" approach.
It does not mean you need to have processes per se. It means starting from the day to day business needs.
For example, let's assume I'm a supply chain planner, a buyer, a production planner or even a customer representative.
Hence, I need to perform my daily activities efficiently, such as:
- Review top 10 stock deviations
- Process invoices
- Check forecast deviations
- Re-plan production plan due to supplier's delays and customer's orders changes
- ... any other work activities performed by operational teams to keep the business going!
For each of the above, the data required to identify, decide and act on these activities are relatively minor and simple to gather.
With a process driven approach, you simply identify the triggers (e.g. top 10 products to look at) and for these you display data connected to it. If you want to check the lowest stock level on a daily basis; you may want: the order book, what is planned in production associated with the level of raw material. These simple yet powerful set of information is more than enough for a trained eye to understand the situation, take a decision and act upon it. You don't need to make it more complex than this.
What does it mean? You do not need to connect all data across every system and make a unique source of truth entirely centralized, always in real time. With a process centric approach, you simply show in one screen all the connected data to a specific operational needs to let supply chain folks take great and fast decisions.
The way this is achieved is to think about a workflow, which get initiated based on rules, events or frequency. Once triggered, the workflow will request the current data to the different systems one by one in order to get just the information required in the context of that activity.
Process and trigger-driven workflows for specific contexts frame a continuous feedback loop where the data is being used in context day to day and there is a clear incentive to capture information and clean up data because it makes the process easier.
A "process centric" and trigger-driven approach is the best way to prioritize, build or complement your data lake.
It supports moving beyond real-time analytics to real-time execution.
It can help getting started in weeks and not months and years.
Note: Traditional data lake implementation approach are ‘must haves’ for strategic analysis and data science. However, there is a caveat to this, the data doesn’t necessarily need to be real-time.So, an arbitrage may even be necessary as to whether build a data lake for strategic analysis. If strategic analysis are only performed every 12 months. It might be best to get a data engineer to prepare the data sets manually on a regular basis rather than having a real time data lake that is used once in a while given the maintenance costs.