Empowering your data experts

What does it take to be a data scientist in 2022?

The breakneck pace of technological change means data scientists are contending with shifting expectations, technologies, tools and strategies

There may be no such thing as a typical data scientist. As expectations of analytics have grown and with the increasing quantities of data available, the data scientist’s role is diverse, exciting and in demand.

For a long time the role of ‘data scientist’ lacked definition. It was a buzzy label that data-savvy people could hand themselves on LinkedIn, with little push-back. For employers, it became a catch-all term for ideal candidates of impossibly numerous talents – with expectations set for ultra-technical statisticians, strategic leaders, and operations specialists all at once.  “We have to blow up this idea of the data scientist unicorn,” says director for the Institute of Experiential AI, Dr Usama Fayyad. Instead, candidates and employers should set realistic expectations, with this all-consuming ideal model sliced into its various components. Some people may be better at technically complex tasks like probabilistic modelling; others may be better at data translation – communicating results and fitting insights into a strategic, business setting – which may not be the best use of highly technical engineering or science skills. 

We have to blow up this idea of the data scientist unicorn

There is movement in resetting these expectations, though, one that’s perhaps inevitable with the breakneck pace of technological change. 

When data science was a smaller field than today – there were 250,000 job vacancies in 2020 according to some reports – the expectation was that data scientists had to know everything, about all types of models, and how they worked.

Now, there’s a broader recognition that the field just moves too quickly for that to be realistic, and it’s more commonly acknowledged that specialisations, such as NLP or forecasting, can be extremely useful, says Michael Shores, senior director of data science at Vista.

“This also applies to the languages data scientists work with – a few years ago, data scientists used to be polyglots, knowing R, SAS, Python, and possibly a few others,” Shores explains. 

“But the field has coalesced around Python. This poses an interesting labour problem for highly regulated industries where ‘old’ languages like SAS are still being used, but there’s scant new talent that already know, or are willing to learn, these ‘dated’ languages.”

While there may be a convergence in the common languages used by data scientists, the opposite is true for the types of data that data scientists analyse: the reams of unstructured data in all of their ‘modalities’ – whether audio, visual, table or text recognition – and the new ways of interpreting them opened up by deep learning technologies. 

“Deep learning is here to stay and it’s gone beyond simple optical character recognition,” says Edgar Meij, head of AI discovery at Bloomberg. “It’s become the multi-tool pocket knife for machine learning, so understanding the possibilities – and what the capabilities are not – as a data scientist is critically important.”

Additionally, says Meij, while there have been enormous advances in automated machine learning (AutoML) or automatic hyperparameter tuning, keeping humans in the loop will always be essential – for interpreting and making sense of everything from inputs to analysis. 

The same principles apply to annotation, he adds: “It’s one thing as a data scientist to look at some data, query it, transform it, analyse it, model it and present the results. It’s another to get those results validated. So if you have a model that makes predictions – look at some of those predictions, and work towards getting a continuous annotation feedback loop going, allowing data scientists to lean on human judgements.”

Keeping abreast of the latest developments on the horizon of data science is one way that data practitioners can help to safeguard their skills for the rapidly changing future.

But it’s hardly just technical change on the horizon. Data scientists must grapple with responsible AI, and the concepts of fairness, accountability, transparency, ethics, and sustainability – FATES, for short – as technology uses become more common in our daily lives.

“Once seen as fringe, more people are paying attention to these aspects, especially when data science is used for more automated decision-making that involves people,” comments 

Professor Paul Clough at the University of Sheffield. “Linked to this is the ethical use of data, ensuring people are aware of how their data is used and have provided appropriate consent. Related to this are topics like explainable AI, seeking to make clear how algorithms, especially neural networks, arrive at an outcome.”

This is just one reason why it’s so vital that data scientists have a seat at the leadership table. 

While most businesses will be cognisant of the transformational power of data, they will need people who can translate this most technical of subjects into the language of business, inform decision-making, and the know-how to embed transparency and ethics into the day to day. 

After all, says Meij, quoting a motto frequently mentioned by his boss, Michael Bloomberg: “If you can’t measure it, you can’t manage it.”

How machine learning can give you an edge

With organisations often drowning in a vast amount of unstructured or semi-structured data, data scientists are turning to AI and machine learning to help them manage and make sense of it all

In a ghost broking scam, fraudsters buy legitimate insurance policies using false or stolen details before selling them on to unwitting consumers. It’s one of the hardest sorts of fraud for insurance companies to spot. For insurance company Covéa, which provides more than 2 million quotes per day, it was resource-intensive to monitor. 

Thanks to machine learning (ML), Covéa can now scan insurance policies using artificial intelligence. A series of ML models scan policies 24 hours a day, looking at millions of individual pieces of data. The model can predict with a high degree of accuracy if a policy is fraudulent. “This used to be done by the financial crime team, who would scan policies and base a decision on ‘gut feel’, says Tom Clay, Covéa’s chief data scientist. “We built a model that replicates how their gut feel worked but can scan more data, more quickly. If the ML flags a policy, it is passed along to the financial crime team to be checked manually.” 

The human brain can only consider a few variables in a decision-making process. Machine learning, though, can include hundreds or thousands of variables and do so much faster than a human. When the quantity of data is doubling every 18 months, machine learning could help businesses to solve a host of problems.  

A simple example of how this works can be seen in traffic apps that tell you the best way home. The software considers variables that include the weather, historical traffic reports, the time of day, roadworks – and then selects the optimal route.  

In financial services, machine learning is widely used to evaluate fraud risk. At Covéa, the technology has substantially increased fraud detection rates and Clay says the project delivered 11 times the ROI in less than a year. And the benefits aren’t only financial. “ML allows us to work our customer data, which is unique to us and a real USP. That helps us to understand customers better, target products better and achieve a real competitive edge,” he says. 

With organisations often drowning in unstructured or semi-structured data, AI and machine learning offer a way to derive a competitive advantage from that data, says Simon Case, head of data service with IT services firm Equal Experts. “Organisations have a wealth of data but in many cases, it isn’t being used at all. But within that data might be information that helps logistics firms send products in more efficient ways, or financial organisations to reduce risk, or to decide whether someone can afford a loan,” he says. 

Organisations have a wealth of data but in many cases, it isn’t being used at all

The idea of building a predictive model based on historical data sounds simple. The truth is that relatively few organisations have a single data set that can act as ‘ground truth data’, says Case. Organisations have data in multiple formats, silos and locations, making it hard to get a full picture. Added to that, deploying, scaling and managing models across an enterprise can be a time-consuming and costly process. 

When Covéa launched its AI programme, just 2% of its customer data was consistently labelled, says Clay. He estimates it took almost two years to get all the data properly labelled and to train ML models. “We had to ask the business to bear with us when it came to developing the ML models. We used a cloud-based MLaaS service to help us develop, test and deploy models consistently, which delivered an 11% ROI within six months,” he says. 

The good news is the emergence of off-the-shelf ML applications and ML as a Service (MLaaS) platforms make it easier to get started with machine learning. Large solution providers can help guide the development of ML models and handle the heavy lifting of data analysis, reducing pressure on in-house IT resources.

MLaaS is an emerging set of services that can help enterprises get started with ML faster. They can provide processing power for AI tasks, and help development teams to develop, build, test and deploy models at scale. According to a report from Transparency Market Research, spending on MLaaS is expected to grow to almost $20bn (£16.7bn) by 2025.

That doesn’t surprise Case, who sees an explosion in demand for enterprise AI services over the next five years. “Implementing AI solutions means businesses can understand customers better, which means they can develop products and target them more effectively. They can automate boring or routine tasks, which saves money and frees up people for more important jobs. And in many cases, it lets organisations do things that just weren’t possible before,” he says.

Commercial feature

Delivering AI: the right tech at the right time

Machine learning helps organisations extract value from data but not without cost. Creating an efficient path to value for AI requires agility in where, and how, enterprise data science and machine learning teams work

Artificial Intelligence (AI) isn’t magic. It is built with hard work, brains and a lot of code using techniques from classical machine learning and statistics up to the latest NLP models. From skilled labour and supporting software to powerful hardware, it's a full-scale endeavour with its own operations and economics. 

Machine learning models are the heart of AI, and to make accurate predictions they must be trained and optimised over many iterations, typically with vast amounts of data. These iterations aren’t just algorithmic. They are often conceptual as teams tweak their approach to solve a problem. Increased iterations lead to higher model quality, which directly improves both revenue and margins for companies, especially in business applications using things like recommender systems. 

So, many companies are asking: “how can we get more iterations in less time and at the right cost?”

Optimising your iterations is only achievable if you have the right technology, says Jeri Culp, director of data science with HP. “When you’re talking about the vast quantities of data used in machine learning, [iterations] may be limited by cost if you’re operating in a cloud environment, or by the limitations of your hardware, applications and skills when operating in-house,” she says. Regardless of where you work, cloud or on-premises, the underlying hardware is a major determinant in how quickly your teams can iterate. 

Cloud environments offer flexibility and scale to help maximise iterations, but they also present challenges for teams with constraints on data location or tight operational budgets. It all depends on the context. For example, size or regulatory issues may require the data to reside on-premises, ruling out the cloud. Or, maybe your division is working to minimise operational expenses in a given year, meaning you can’t have big monthly bills. 

In such cases, your option is to have teams work locally on your own hardware, i.e., in the data centre or at the edge on devices like laptops or workstations. In these cases, it is crucial to make sure your teams can still get the number of iterations they need. This goes right back to the technology you use. “The more iterations you can do the better the overall result, so there’s a huge benefit if you can run iterations more quickly, in-house,” says Scott McClellan, senior director of data science and MLOps (machine learning operations) at Nvidia.

Ultimately, what organisations need locally is more computing power. With a more powerful machine, data scientists can run more iterations on bigger data sets, delivering better results to the business. “The right workstation is the workhorse of the machine learning process. You need to ensure you have the right hardware with enough GPU, CPU and memory for your specific workflow,” says Culp. 

Compared to a standard PC, data science workstations offer huge performance gains that allow data scientists to boost their iterations, helping to deliver better models in less time. “Accelerating the timeline means you deliver benefit to the business more quickly, with more accurate models that are easier to fine tune,” Culp adds.

This kind of local power is especially important for tasks like transfer learning. Here, a data scientist takes a pre-existing model that may have cost millions of dollars to train and then adapts it to do some specific task by training it on a smaller amount of data.

One challenge of using workstations is the time and skill needed to configure them and install the data science and machine learning software needed. For example, a recent survey by HP found that 42% of data scientists say it takes too long to configure their environment. HP’s Data Science Workstations are pre-loaded with a full AI toolset, allowing your team to get to work more quickly while giving them access to the latest frameworks and optimised software from Nvidia. The HP Data Science Workstations have next-generation Nvidia professional GPUs that can withstand the heaviest usage and most demanding processing tasks.

We’re at the start of a very exciting journey, with huge opportunities to do anything from self-driving cars to automating routine manual processes

With a small investment, it’s never been easier for organisations to make the leap into machine learning and AI. “We’re at the start of a very exciting journey, with huge opportunities to do anything from self-driving cars to automating routine manual processes,” says Culp. “It’s about seizing the opportunity to transform data from being a cost to being a differentiator that makes you more competitive.”

Fuelling innovation at American Airlines

When American Airlines customers need to ship cargo, they usually book around 10 days in advance. Because payment is made on delivery, it’s not unusual for some booked cargo not to arrive at the warehouse on time. 

The challenge for American Airlines is how to maintain an efficient service while predicting the likely load on any given day. No shows cost millions in lost revenue, and inefficient cargo loads burn more fuel. 

The company relies on HP Data Science Workstations to run models that predict how likely it is that a cargo shipment will arrive, so they can plan shipments ahead of time. The workstations run a machine learning model that examines each customer order and identifies those least likely to arrive. The American Airlines team can then reach out to that customer to confirm whether or not they’ll make the scheduled flight.

The Z by HP Workstations use a GPU-accelerated machine learning package to analyse 500,000 booking records at 10x previous speeds. With the workstations, the airline can predict with more than 90% accuracy whether a shipment will arrive. They get predictions and results much quicker, leading to increased cargo space utilisation and reduced fuel burn.

What are some of the key challenges facing data teams?

Data is such an in-demand and valuable asset, but there's a way to go to equip data experts with what they need to make the most of it

What is slowing down data scientists and getting in the way of everyday work?

A lack of computing power and resources causes challenges and holds data scientists back

Many data scientists also feel misunderstood by management

With data science the most in-demand IT skill, supplying employees with the right tools is vital

Fastest growing IT skills throughout 2021 (% growth year on year)

And with the global volume of data rising, our need for data scientists is only set to grow
Volume of data forecast to be created, captured, copied, and consumed worldwide from 2020 to 2025 (in zettabytes)

How to develop the next generation of data scientists

How can businesses, educators and governments address the widening data science skills gap – and encourage young people into an increasingly demanding, complex role? Changing perceptions may be a good place to start

The data science skills gap is less a gap than it is a gaping chasm. In 2020 alone, there were 250,000 jobs vacancies in this sector, outstripping employee supply. Nearly half of British businesses were recruiting for roles that required data skills in 2021 and, with the explosion in data showing no signs of abating, demand is set to rise further: according to the US Bureau of Labor Statistics, the number of jobs in data science will grow by 27.9% by 2026.

Combined with a challenging hiring environment – talent recruitment is the top issue for almost half of all employers, let alone in specialised areas – educators, businesses and governments need to think fast about how to help supply meet demand.

But when Harvard Business Review described data scientist as the “sexiest job of the 21st century”, in 2012, it may have hindered rather than helped recruitment, suggests Ken Jee, who is head of data science at Scouts Consulting Group. It led people to imagine a dream job in the top pay percentile, which did not set realistic expectations. “This has led to a lot of attrition and dissatisfaction in roles,” he explains. “In reality, data science is great for people who want a certain lifestyle and like to address specific types of problems. It is definitely not for everyone.”

Perhaps first on the agenda should be changing perceptions. Sexiest job or not, STEM students tend to view data science negatively, as hyper-competitive and ultimately lacking purpose. This is also contributing to a gender diversity gap – just 15% of all data science students in the UK were women. This is especially important given the need for genuinely representative data sets to inform AI systems to guard against bias.

Such issues are acknowledged more today than they were in the past: witness the UK government’s recent policy paper on the data science skills gap, while organisations like the Alan Turing Institute run initiatives for under-represented groups. But there is a need to address the industry’s image problem and better communicate the value of data science and its many facets.

“The best way to nurture a passion for data science is by telling interesting stories about it,” adds Jee. “Computers are beating humans at chess, Go and StarCraft – awesome technology advancements like this have been made in the last few years but AI or machine learning are perceived as too complicated for a broad audience. Real stories about this progress could transform this spark into a brilliant flame.”

The best way to nurture a passion for data science is by telling interesting stories about it

Meanwhile, when students graduate, employers have come to anticipate data science unicorns to apply to advertised roles and have unrealistic expectations for the perfect candidate. Data science is perhaps unique in the need to constantly upskill and reskill to keep pace with technical developments. This makes the field more nuanced than ever. 

At the same time, soft skills are increasingly important, while ‘data translation’ and data-led strategy are also in high demand. “The capabilities that make great data scientists are not just technical,” says Richie Ramsden, technical director at the National Innovation Centre for Data. “Collaboration, communication and attitude are rapidly becoming as vital as technical skills.”

The next generation of data scientists may have to lean into certain specialisms, rather than trying to take the catch-all unicorn approach. Students should be encouraged to work towards their own talents, whether ultra-technical or more focused on communications, ethics, responsible AI, or strategy. Communicating increasingly complex data sets is more important than ever, says professor Paul Clough at the University of Sheffield’s Information School.

”Data scientists need to develop ‘data translation’ skills, whereby they don’t just understand the technologies, methods and what is possible, but can translate the tangible benefits of data science and AI in a business context,” he says.

Data science is booming as a career

For Dr Usama Fayyad, executive director, Institute for Experiential AI at Northeastern University, the current approach to data science education in universities fails to focus enough on the practical. And he believes that as in medicine, where students are supervised by experienced professionals to apply their knowledge in practical settings like hospitals, a similar approach should be true for data. That way, students “have a feel for what it takes to do a project in real life – when half the data is missing, or a third is wrong, dealing all the way from what’s responsible and ethical AI, down to what’s realistic in terms of deployment.”

It’s a point echoed by recent graduate Hannah Alexander, who came to data science with a mechanical engineering degree and is now a junior data scientist at Ascent. She says there’s a difference between learning in principle and applying in practice. “For those who learn data science in an academic setting, there should be some emphasis on solving real-world problems,” she says. “The hardest part is the application.” 

But Fayyad isn’t worried about the data scientist skills gap in the long term. “This is the agriculture and harvesting of the future,” he says. “We will get people to enrol simply because the value is there.”