Category Archives: Analytics, AI and Data

Data Science for Grown Ups – How to Get Machine Learning out of the Lab to Scale It Across the Enterprise

The Lab Trap

Have you ever wondered why so many Internet companies use algorithms across their core business processes to drive automation and constantly optimise their business while your algorithms never really left the lab? You are not alone. Many large organisations invested heavily in data science and big data over the past few years and often struggle to scale their successful machine learning projects beyond a small pilot scope. Something went wrong when we tried to copy the digital players. Algorithms stay in the lab and are not put into the heart of the enterprise. This article sheds some light on how traditional big corporates can take a leap forward towards the digital players and is a built up to my previous article „10 Rules for Data Transformation in Inherently Traditional Industries”.

Dr. Alexander Borek, Global Head of Data & Analytics, Volkswagen Financial Services;
Alexander will be speaking at the Enterprise Data & Business Intelligence and Analytics Conference Europe 19-22 November 2018 in London on the subject, Data Science for Grown Ups: How to Get Machine Learning out of the Lab to Scale it Across the Enterprise

How we all got there

A few years ago, the business world realised that Machine Learning, Big Data and AI can generate new value out of large amounts of data generated through digitalisation of the business. New tools, new types of databases and and the rise of cloud computing allowed very flexibly to combine and process high volumes and diverse formats of data bringing new flexibility in working with large amounts and varieties of data. New ways of working between business and IT aimed at bringing rapid business value were introduced in the Tech startup world and copied to more established businesses bringing new agility. Machine learning and AI methods entered business life with the effect that processes can be increasingly automated.

We hired plenty of data scientists and let them do magic. Somehow inside the lab use cases worked, in very fast time we could solve complex business problems, but while transporting them to the real world they suddenly broke down and collapsed. Outside the lab we often find a hostile environment which create a number of challenges and threats for our precious little algorithms:

• Different toolsets and architectures
• Cloud cannot be used
• Legacy IT Systems
• Slow and complex purchasing and approval processes
• Risk, security, data protection, regulations and compliance issues
• Deployment procedures unfit
• Traditional IT operations unfit
• Data quality issues
• Inconsistent data models
• Strong cultural resistance
• Data Scientist are unexperienced and don’t know „how the company is running“

This is because we never fixed these problems, we just tried to create a new free space where innovation can grow and foster within a safe environment, the Innovation Lab, where everything was sort of allowed. What works inside the capsule of a innovation lab does not necessarily work in outer space, i.e. the rest of the organisation. Many organisations just ignored the fact that anything that comes out of the lab will be dead within seconds as it leaves the „free like a bird environment“ and enters the „caged bird“ environment that we are used to in large corporates. Many IT organisations feel threatened by the labs and have a low motivation to help them bringing successful prototypes into production. Regulations such as GDPR are seen as helpful allies to reject the work of labs as unrealistic and in-compliant.

The Data Factory can build the bridge between the two worlds

In one way or another we need something to create the bridge between the two environments to successfully deliver data analytics & AI products and in my opinion this bridge is the Data Factory. The Data Factory needs to replace or extend the Data Lab to ensure that new innovative projects can be executed and then better scaled into the rest of the organisation. There are at least five key components of such a Data Factory, possibly even more.

(1) The Data Platform ensures that technologies are available outside the capsule. It ensures a common state of the art data analytics & AI tools for sandbox, development & production stages, which means that an idea for an algorithm can be explored and then later operated on the same technological environment. Standardised programming languages (e.g. Python), tools and packages are used throughout all phases of the project. Security, data protection and compliance is ensured on platform so the algorithm developer does not need to worry so much when the algorithm leaves the lab. It provides a central data storage for productive data as part of a joint data lake and data catalog and standardised accesses and interfaces to legacy IT systems.

(2) Processes outside the capsule need to be updated to ensure they can handle data analytics & AI projects. They include at least the PLEASE processes: Purchasing, Legal, Evaluation, Auditing, Security and Ethics.

(3) Data Engineers are usually more important than data scientists after prototyping, but many companies have hired a lot of data scientists and not enough data engineers. They ignored that once the analytical model is designed, it is mostly about software engineering! Data engineers have the focus on software engineering rather than modelling. Software and architecture skills are key for developing ETL processes, integrating data ad-hoc and cleansing, writing APIs and database connections, creating CI/CD pipeline and DevOps, testing and deploying data products and adding new components to the platform.

(4) Data Ops are needed to support the operation and maintenance of finished data analytics & AI products. Data Ops contains tasks that are perceived as unattractive for data scientists, e.g.:

• Rules for deployment
• Helpdesk and Ticketing
• ITIL processes
• Managing.SLAs
• Logging
• Archiving

Furthermore, IT Operations and Data Scientists often speak different languages. Data Scientists feel misunderstood because things are complicated and slow. IT Operations are irritated by the perceived ignorance towards corporate processes of the data scientists.

(5) And here comes the Data Product Manager into the game! The Data Product Manager is always 100% involved from start to end of the data analytics & AI product. The Data Product Manager is a true multi-talent in data science and management. He or she holds the end to end responsible for the Data Analytics and AI product, which includes:

• Ideation and data product definition
• Product owner in scrum approach
• Managing all stakeholder relations
• Accountable for deployment
• Ensuring SLAs during operation
• First contact for change requests
• Change management

The Data Product Manager understands the Data Scientists and Data Engineers, but also speaks the language of more traditional business functions. He is the key person to bring change within your organisation and help to create the cultural bridge between the two universes within your corporation.

Obviously there are more further important elements that should be considered as part of every successful Data Factory that I did not mention here. Nevertheless, the Data Factory concept presented in this article should make you rethink how you organise Data Science, AI and Business Intelligence across your enterprise. Maybe a key learning is that Data Science, AI and Business Intelligence are closer than you think and should come as close together as possible!

10 Rules for Data Transformation in Inherently Traditional Industries

I gave a keynote at the Data Insight Leader Conference in Barcelona this week speaking about the 10 rules of data transformation in inherently traditional industries based on the experiences I have made so far. Interestingly, there was an agreement among the Data Executives in the room that we need to keep doing showcases while transforming the company and it all needs to be wrapped up in a powerful narrative.

Here are the ten rules that I presented.

Rule 1: Accept that digital transformation success is a myth

As discussed in my last blog post, there is an inherent dilemma with digital transformation. In simple terms put, you either appear successful at digital transformation by focusing on digital showcases (happy honeymoon) or you try to do the real transformation changing how the company operates (the endless road). When you are doing digital transformation, never just simply say you are doing this and that because of this and that. Always tell a story where we are today, what the steps in between are and what the end game looks like.

Rule 2: Demonstrate how basic beliefs in your industry are turned upside down due to digital disruption

A major mistake that many Digital Transformation Executives do at the beginning is to assume that others understand and share the basic beliefs about the success factors and changes in their industry. But they don´t. Others in your company, which includes the top management, have been running their company based on the same unchanged beliefs for decades. Such beliefs consist of things that have been true for a while. Examples are the standardization of processes to reduce cost, hardware product centricity to ensure product quality and attractiveness, number of sold products as key KPI and the number of physical stores is a reflection of market power. Digital disruption suddenly turns the world upside down. It might be that software engineering becomes equally important to hardware engineering and manufacturing. Data & Analytics become a central part of product quality and customer experience. Its not about physical stores but digital touch points. At the very beginning, the CDO needs to set the scene and also explain its implications and then ask directly for the changes to be implemented to match the implications. It is important to tell your senior management explicitly how basic beliefs are turned upside down due to the forces of digital disruption and which implications this have on corporate strategy, organizational structures, KPIs and incentive plans that need to be introduced or adapted. Complementary implementation projects which demonstrate value and roadblocks can be of help that can be discussed as tangible example.

Rule 3: Communicate the simple equation: Digital = Data + X

This is the most important formula for Data Executives. Communicate it at all times. Anything digital is a result of data (collecting, using, combining, analyzing data) and something else on top. It means that there is no digital transformation without a data transformation. Any Chief Digital Officer that says we will deal with Data & Analytics later since we have other more important priorities for digital transformation at the moment misses out that anything else digital he wants to do requires Data & Analytics. Unfortunately, in very product driven companies this happens very often. Communicating the magic formula constantly and explaining it with tangible examples reminds everyone around us that data is a key ingredient to any form of digitalization effort and digital product. The fact that we always need something more than data & analytics puts any data executive in a strategic disadvantage. Simply put: If others don’t do their job, you are screwed. So you better choose projects were you can rely on the X! The best algorithm to determine the optimal pricing of goods sold does not add much benefit, if the results of the algorithm are not used inside an e-commerce portal to improve pricing. This requires changes in the e-commerce portal itself and the processes around it.

Rule 4: Train your existing workforce in data analytics – everyone can learn it

It is naive to think that you can hire all data scientist from other companies. You can hire a few experts, the rest you need to train. And its not only the data scientist you need to train. You need to train also the data and digital project managers and program managers and the people that steer them and their top management. They all need a better understanding of what it needs to build a great data product. A lot of culture change comes through this type of education. It is therefore absolutely essential and not a side activity.

Rule 5: Ask the board to delegate decision making power to cross functional data analytics roles and bodies

The board cannot decide on all data aspects. They need to decide on the governance framework and strategic decisions, the rest of it needs to be delegated. Here the problems start. These decisions are delegated to individual departments such as sales and production. Customer data decisions are delegated to the customer departments, production data decisions are delegated to production departments. This is wrong. Decisions on data should be ultimately taken by cross-functional committees and cross-functional data roles and not individual departments. One data producing department might not see a need to provide data in a well readable format to a data consuming department. Creating the cross-functional committees and roles for data and analytics is one of the first things you should do as Data Executive.

Rule 6: Free your data

Everyone in the media talks about the large volumes of data that are available to be analyzed for all sort of purposes. The brutal reality in traditional companies that we often find no lakes at all, instead a vast data desert. Data is locked away in hundreds of different legacy systems that cannot be used due to their instability and risk of impacting running operational applications. This is something that top management does not perceive since they get all the analysis they ask for (within a day). Even worse, company politics prevent data from being shared. And GDPR and other compliance challenges make this even more difficult. Any project suffers under it. Any. Show to your board the need to create an electronic workflow to regulate data sharing and access involving all control functions like legal, security and data protection with concrete examples. Document how long it takes to get the data and what the problems are GDPR and other approvals. Create a virtual or physical data lakes with an integrated access layer to provide the data to the data scientists and data users. The workflow has prechecked criteria and categories. Instead of having individual decisions all the time by the control functions, they preapprove some data usage for some purposes.

Rule 7: Share knowledge on data

Our romanticized view of what data scientist do all day is that they create complex statistical models and algorithms, apply deep learning and other sophisticated machine learning methods. Wrong. Data scientists in traditional corporations search for the right data all day long. Once they find it, they need to find out what the data means by trying to set up meetings with business and IT departments that do not see helping the data scientist as part of their regular job activities. Its more like their hobby or welfare activities. Just let everyone finding out things document it in the business glossary. Its a give and take model. You document and next time you can find something that others have entered. People in the business departments have also an incentive to feed their data knowledge into the system, since they have less work with explaining data to each new data projects. At the end, all we need is some way of tracking these activities as part of each project and some good peer pressure. This usually reduces around 30-40% of the data scientists and data project workloads. The world can be so simple.

Rule 8: Automate data preperation

Another 20-40% of workload reduction for the data scientist and data projects can be realized if we automate some of the most tiring tasks of data scientists: data preperation. Let´s say you run 50 data projects a year in your company. Lets assume that 20% of the data processing work could be automated once you solved it for the first time. If you only half of the time and resources to automate these data preparation tasks, you have a pretty good business case. After a while, you will realize the full 20% savings.

Rule 9: Embrace an open source, cloud and AI first strategy

In the data and analytics space, a lot of the great tools are open source. The good thing about open source is that you can use it across the entire organization and often more Cloud ready than many commercial software tools. Students from university typically know the tools and there is loads of training. They are much better interoperable with other tools. And you can replace them much faster once they are out of date. This all speaks for an Open Source first strategy. Especially in data science, data scientists need fast access to tools and data with flexible computing power. Hence, adapting a Cloud first approach is the best way forward for most companies. The best way is to use a hybrid cloud approach (private and public). AI assistants like Siri, Alexa and Cortana are currently reshaping the way how we interact with machines using natural language and automating business processes and decisions in the background. Building new applications should therefore follow the AI first paradigm, no matter if it is about internal process optimization (e.g. IT helpdesk) or customer facing applications (e.g. customer support).

Rule 10: Balance data analytics innovation and transformation

Last but not least, combine some elements of each “Data Innovation” and “Data Transformation” in every project you do, right from the very beginning. This will keep people happy while buying you time to do the long-term stuff. It might sound simple, but it is effective. The trick is to find the right mix, e.g., when you run an analytics pilot, also work on the data quality or data collection in parallel.

The fascinating aspect to me is that doing digital transformation is one of the most difficult tasks that anyone can get. Nobody likes change. You receive so much resistance and people try to politically kill you. Still, most people that I know working in this field are passionate and love their job. They have the feeling that they do something with a higher meaning. Their jobs can have significant visible impact in the most positive sense after a few years. You don´t necessary get the recognition for it. But you can see the result after a while and be proud.

Smart Machine Marketing and the Algorithmic Economy

The reason why Smart Machines are so much more powerful than conventional computer programs are the advanced AI algorithms and the data that they can absorb. Smart Machines can sense their own state and their environment, can communicate with other Smart Machines, they are self-learning and can solve very complex problems, and they can act, sometimes autonomously. There are many technologies behind the capabilities of Smart Machines. The most important enabler is the massive amount of computing power and storage that is available today for a relatively cheap price, which makes it finally possible to apply computational heavy artificial intelligence algorithms that would have not been possible some years ago.

Many characteristics distinguish traditional software applications from Smart Machines. Computers have always been pretty good in repetitive and clearly described tasks and in applying strict logic and complex mathematics. An abundance of tasks today are solved by computers much faster, cheaper and more reliable than by humans. Yet, in many ways, computers appear oftentimes annoyingly stupid. Have you tried to have a meaningful and interesting conversation with a computer? It can be a difficult and typically very frustrating endeavour. What computers are missing is the ability to understand the meaning of what we have to say. This is because language is very ambiguous. The very same sentence can mean something completely opposite if said in another situation or by a different person. “I love this computer” could mean either did I really like my computer a lot but it could also mean that I really hate my computer because it doesn’t do what I want it to do. It is very unlikely that love refers to romantic love in this context. The idea that computers can think like a human sounds stretched, it is, however, closer than you might think.

It is changing with the up-rise of Smart Machines. It makes machines being able to handle situations with ambiguity, sparse information and uncertainty, thus, be able to solve human kind of problems. Instead calculating the optimal solution using a predefined algorithm, smart machines evaluate different options and choose the best option out of the possibilities. Problems do not need to be provided in a specified machine readable format, they can be simply formulated in natural language or even normal speech. Looking at the context of the problem makes it possible to interpret the question correctly. When I ask a smart machine “what is the best restaurant?”, it should understand that I am probably looking for a good restaurant that is not too far away from my current location. Based on the outcomes of an action, smart machines can learn and improve their problem solving. Instead of being programmed, they can read PDF documentations to understand a business process and observe how humans perform a business process to build its own knowledge base and eventually be able to handle the business process on its own.

The key components of a Smart Machine are depicted in the figure and will be explained in detail in the following. An incentive and rule system needs to be set for a Smart Machine which provides a purpose for the Smart Machine to exist (e.g. as a self-driving car) and the rules it needs to obey (e.g. ethics, law, company procedures, business goals).

Smart Machine Marketing Artwork

In order for machines to see, feel, hear, smell, and taste like human beings, all aspects of the physical world need to be translated into “digestible” data for machines to process, reason, and act. The rise of low-cost sensor technologies and the Internet of Things with its connected devices enables the collection of data from the physical world without human interaction. All senses are needed to cover an entire customer journey from inspiration to usage. The augmented senses of machines allow a broader, deeper, and more personalized customer experience. Sensed information is fed, interpreted, filtered, interlinked and used to initiate further activities.

The most important ability of Smart Machines is to process the sensed information similar to the way us humans process information (i.e. empirical learning). Smart machines are able to think and solve problems by understanding and clarifying objectives (and sometimes coming up with their own objectives), by generating and evaluating hypotheses, and by providing answers and solutions like a human would do (and unlike a search machine which gives a list of results). Smart Machines are self-learning, they can adapt their own algorithms through observing, discovery and by doing.

Finally, Smart Machines can act, by visualizing and providing the responses to a human decision maker, by informing or even commanding a human to execute certain activities, or in the extreme case, by completely autonomously executing a business process or any other actions. Based on the results of the actions, Smart Machines are able to re-calibrate their goal setting.

The impact of Smart Machines will be observable in three domains for Marketing Professisonals. First of all, customers will get a more contextualized and personalized experience. Secondly, the marketing departments will be able to do more with less people building on automation and scale of intelligent algorithms that take over some of the human labor. Thirdly, there will be advances in the customer journey possible which are of disruptive nature.

The marketing profession will be impacted fast and significantly by Smart Machines and the Algorithmic Economy. Personalizing and contextualizing the customer experience is the aim of everyone. But, creating meaningful continuous 1:1 interactions can only be feasible on a large scale with thousands or millions of customers if Smart Machines take over a lot of the work. This means that smart machines take over work reserved for humans in the past, as, for example, generating new content, and supervising staff in retail stores to ensure high customer engagement. It also means that those companies that still struggle with data-driven marketing will be in deep trouble. Those who embrace Smart Machines will be able to drive productivity beyond the imaginable for marketing and sales within the next decades.

Like all things in life, Smart Machines are all a matter of perspective. For marketing divisions in traditional companies, they might be seen as the biggest threat in history. The way most marketing departments work today is very reliant on human labor and decision making. Shifting the work to Smart Machines will make a lot of the abilities needed for traditional marketing personnel redundant and will require new capabilities that the workforce does not necessary have. For others, Silicon Valley startups and companies, Smart Machines generate an once in a lifetime opportunity. Smart Machines enable them to scale their limited resources and, thus, be able to challenge even the largest established players in their own strongholds, irrespective if it is retail, consumer goods, banking, insurance, manufacturing, entertainment or any other type of industry which requires Smart Machine Marketing.

Industrializing Data Science and Analytics

Gartner´s “Hype Cycle for Advanced Analytics and Data Science 2015″ has just been published. The trends indicated in the hype cycle show a rising maturity of this young organizational discipline.  It is interesting to see that the buzzword “Big Data” has finally disappeared from the hype cycle, while machine learning (a discipline that has been there for decades, at least in academia) has reached the peak of inflated expectations.  This underpins a tendency to  move from big data (the bigger the better) to smart data (the smarter the better). Simply said: “No matter if it is big or small data, it is still data and we aim to get more value out of it.”

A trend that is also visible at a second glance is the emerging industrialization of data science, which is underpinned by a number of developments. Vendors increasingly support the management of analytical models built by data scientist over their entire life cycle, when they are scaled from prototype to company-wide adoption. So far, the management of analytical models has been rather disorganized in most companies.  Data scientists would create new models on an use case by use case basis. Some of the models have been actually doing what they promised to do and would be deployed in operations.

An end to end management of the models and a reuse of solution patterns for analytical models across the enterprise has not been actively enforced or governed. In a new project, they would often start nearly from scratch although a similar model might have been already developed in a different business unit. From an organizational point of view, it makes sense to have a centralized data science unit that can support data scientists in decentralized business units. A central data science unit can ensure that learnings are incorporated and fed back to the organization and that analytical models are consistently governed even after they are handed over to IT.

Very connected to this is the concept of the model factory. The idea is to bring automation and scalability to the process of building and deploying predictive models.  To find the best models, a huge number of models are built and tested using software tools that provide a high  degree of automation during devleopment. At the end of the process, only the best few models are deployed.

Finally, a thrilling concept comes from Gartner´s Alexander Linden, which is the concept of the analytics market place. Some companies such as Microsoft, Rapidminer and FICO have created marketplaces, where data science services and additional functionality are provided by third parties which can be purchased by users of the analytics platforms. This can become a true game changer. Similar to the third party apps and services provided at, analytics marketplaces could become a source of millions of very domain specific analytics micro-applications that drive innovation.

Today, we stand only at the beginning. I am convinced that in a few years time, data science and advanced analytics will be as industrialized as traditional IT. What has changed with the uprise of data science is the speed with which new applications are developed and deployed, the increased willingness to experiment and the direct way data innovates business models and business operations. Now, we only need to scale it to the rest of the enterprise to reap the full benefits.



Data Quality Expert Panel at DGIQ in San Diego

I really enjoyed the Data Governance and Information Quality conference that took place in mid June in San Diego. There were many great talks, a highlight was Anthony Algmin  who talked about his first 100 days as the new Chief Data Officer at the Chicago Transit Authority.  A great keynote was given by Scott Hallworth about the data quality journey at Capital One. Nancy Fessatidis of SAP gave a keynote on an emerging topic that gets a lot of attention these days: the ethics and morality of big data. The panel on controversial issues in data governance had been a great ending of the conference.

During the conference, I gave a half day tutorial on “setting up a data quality risk management program at your organization”, where I could enjoy a very active and interested audience. Going to a lot of data conferences, I can observe a rising level of interest over the years in applying the risk paradigm to data quality, especially from regulated industries like banking and insurance.

I also participated in a very interesting panel discussion on data quality best practices with Michael Scofield, Peter Aiken, David Loshin and John Talburt, in which I have highlighted the role of business outcome focused data quality metrics.  You can watch the video of the panel discussion below.

Big Data – Big Risk: Why Companies Need Total Information Risk Management

In our book “Total Information Risk Management: Maximizing the Value of Data and Information Assets”, we argue that data has become the major source of risk in most industries. Never before in human history, data could create so many opportunities and do so much harm to an organization`s success as today. Data has penetrated quickly into every corner of our society. We use data in higher volumes, of higher variety and velocity and from many new sources like social media and embedded sensors in real-time to drive a majority of our decisions. It has become the most important asset of the 21st Century, sometimes referred to as the “new oil”. Yet, the rising importance of data and information assets makes it also the major source of risk for most companies. Oil can catch fire. When companies finally start to understand the dangers sleeping in poor data and information assets often accumulated over decades and combined with new data from a variety of untrustful sources, it is often already too late: massive mis-investments, huge regulatory fines and permanent brand damages are only some of the consequences that cannot be easily undone once they happen.

Total Information Risk Management is a step by step guide for managers to identify and quantify the business impact of using poor data on business process performance and organizational success and how such risks can be mitigated. Solid measurement and quantification of data and information risk enables companies to generate real accountability and to treat data and information assets seriously and more responsibly. It also gives a great basis to build a convincing business case for data quality improvement.

A very typical situation is, for example, a manager who asks: “How many new sellers do I need to hire to meet my targets?” And the business analysts would come back after a while with the precise answer: “our analysis reveals that 3520 new sellers are needed”, which would lead the decision makers to reply: “Ok, this is interesting, well done, 3520 sounds very reasonable. But how reliable is the data?” The business analysts would assure that the analysis is rigorously conducted using data that comes from a system which is considered as a trusted source by most of the departments. The leadership team, being fully satisfied, would announce the new targets to the rest of the organization: “We need to recruit 3520 new sellers in the next quarter. This is grounded on a rigorous analysis of our business analysts!”

But, what if these numbers are wrong? Who would have time to verify if the methodology and the data that is used to calculate the results are indeed trustworthy and of high quality? And who would dare to question such “hard” facts indeed? And if something goes wrong, management can always refer to the business analysts. And business analysts can easily blame the data behind the analysis, the general complexity of the problem, and other external factors that influence the outcome of the decision.

Literally, millions of the most important decisions made by companies are executed exactly this way – every day. And an incredible amount of these decisions are mislead by poor data and sub-optimal analysis, leading to immense costs and risks in these organizations. There is a general lack of accountability and this is why huge risks are created in companies day by day – and why nobody addresses the true root causes of these problems. The formula is simple: Bad data leads to bad analysis, which leads to bad decisions, which leads to risks in operations and strategy.

So, how can risks from poor data be prevented? Companies can only protect themselves and make data and information reliable assets, if they start measuring the risks created by not having the right data and information of sufficiently high quality. Assessing risk caused through poor data and information assets will make the potential data and information risks tangible and visible to anyone – impossible to be ignored by the business part of the organization. Risk mitigation can then address the causes of the data and information risks with a targeted mix of technologies, transformation of the business environment and suitable information governance.

Leading companies are not the ones that simply use data to drive decision making, but those companies that assure that the risks hidden behind the data are clearly understood, measured and managed pro-actively.