Data pipelines for financial services: choosing the right tool for the job

Financial Services

1 Oct

QuantSpark has extensive experience working with clients in financial services. Typically, their investment teams live in Excel – a manual and time-consuming practice out of step with modern tools. Their data teams often aim to digitise their operations, creating a single platform where portfolio managers can access readily available, up-to-speed data to guide their investment decisions. This goal will dramatically increase decision-making speed and reduce the analyst time required to keep the original Excel sheets up to date.

Building this kind of platform solution usually involves working with multiple data pipelines, containing a huge amount of data. Frequently used data sources in the industry include reports about the performance of individual companies, stock market data, or supplementary data such as about ESG indicators.

Across these engagements there is typically an engineering requirement for our teams, often testing which tools are best suited to building pipelines of all different sizes in order to ensure a steady data flow to business stakeholders, such as portfolio managers or investment analysts. Picking the right tool for the job is critical. For example, some pipelines carry much more data than others and require a heavyweight tool to manage that process. And, ultimately, the pipeline always serves business objectives: improving the data team’s capability and ultimately the investment team’s performance.

Why does this matter to a financial services business?

For a firm looking to stay ahead, digitising business processes and ways of working is essential. The world simply moves too quickly to be reliant on unwieldy Excels and the bigger the file, the longer it takes to manually maintain. A typical hedge fund might run hundreds of similar Excel based pipelines – imagine the cumulative man hours spent updating the files that could otherwise be more productively utilized. Equally, the more complex the logic within those Excels, the more likely it is to go wrong, posing a material risk to investment decision making.

Which pipeline tools work best?

From a shortlist of the most recent tools available, QuantSpark’s team selected two options to test: AWS Glue’s DataBrew, a no-code too designed to clean and normalise data to prepare it for analysis or machine learning models, and PySpark, an API script that allows large amounts of data to be processed through Apache Spark.

Both have respective pros and cons depending on their use cases.

AWS Glue DataBrew

Pros:

Designed to make the visualization of data a simple and straightforward process
Capable of easily ingesting data from Excel sources

Cons:

As a ‘no-code tool’, it can be harder to achieve more complex functions & calculations.

PySpark

Pros:

Capable of flexibly handling most functions whether they are simple or complex.
An ideal tool for automating tasks, delivering solid productivity gains to business teams

Cons:

Since this is a traditional coding script there are typically additional steps or modules required steps to ingest data from Excel input.
This is usually an unexpected step for teams that are new to using PySpark and something that will require extra research.

QuantSpark’s recommendation

Our recommendation for most financial-services use cases, especially those building their first non-Excel pipelines, is to use PySpark. Its flexibility allows the tool to be used across a range of different pipelines and tasks, no matter the complexity. The main technical challenge is being aware of the possibility of needing additional files and modules to set up the pipeline – a burden when used for the first time but easily surmounted for future teams if the process is properly documented.

Having downloaded these extra modules, creating a pipeline is simply a matter of writing a PySpark script and pointing the relevant data to the chosen storage location. The script allows for multiple automation possibilities, ranging from calculations previously carried out in Excel (benefitting investment analysts) to scheduling data refreshes (benefitting the tech team).

In summary, for data pipelines of all sizes servicing similar financial services companies, PySpark proved to be an ideal tool.

To discuss how we use advanced analytics for recurring revenue, contact us.

Get in touch

Are you looking for a team with deep expertise in advanced analytics and modelling techniques to drive value in your business? We can support you.

Related areas of QuantSpark expertise

Similar Case Studies

Featured

5 Mar 2024

Forecasting Sushi Demand: How QuantSpark Identified a $25M Opportunity for a Supermarket Sushi Retailer

5 Mar 2024

QuantSpark was asked by a Private Equity fund to provide an in-depth assessment of sushi retailer’s forecasting process. In doing so, we identified value-creation opportunities of up to $25M in yearly profit achievable through a 12-month roadmap.

5 Mar 2024

29 Feb 2024

Billing Transformation: Harnessing Generative AI and OCR Tools for Revenue Optimisation

29 Feb 2024

How QuantSpark identified high-value complex processes for a global travel logistics firm revealing millions of dollars in untapped revenue. QuantSpark leveraged the latest technologies in Generative AI along with the power of Optical Character Recognition to transform a client’s billing process.

29 Feb 2024

19 Dec 2023

Transforming the daily reconciliation process for asset managers through automation

19 Dec 2023

Daily Reconciliation Processes are crucial for asset management firms needing to establish their positions. Here is how QuantSpark helped one firm automate this otherwise manual task: speeding up processes and reducing errors.

19 Dec 2023