Data pipelines for financial services: choosing the right tool for the job

QuantSpark has extensive experience working with clients in financial services. Typically, their investment teams live in Excel – a manual and time-consuming practice out of step with modern tools. Their data teams often aim to digitise their operations, creating a single platform where portfolio managers can access readily available, up-to-speed data to guide their investment decisions. This goal will dramatically increase decision-making speed and reduce the analyst time required to keep the original Excel sheets up to date.

Building this kind of platform solution usually involves working with multiple data pipelines, containing a huge amount of data. Frequently used data sources in the industry include reports about the performance of individual companies, stock market data, or supplementary data such as about ESG indicators.

Across these engagements there is typically an engineering requirement for our teams, often testing which tools are best suited to building pipelines of all different sizes in order to ensure a steady data flow to business stakeholders, such as portfolio managers or investment analysts. Picking the right tool for the job is critical. For example, some pipelines carry much more data than others and require a heavyweight tool to manage that process. And, ultimately, the pipeline always serves business objectives: improving the data team’s capability and ultimately the investment team’s performance.

Why does this matter to a financial services business?

For a firm looking to stay ahead, digitising business processes and ways of working is essential. The world simply moves too quickly to be reliant on unwieldy Excels and the bigger the file, the longer it takes to manually maintain. A typical hedge fund might run hundreds of similar Excel based pipelines – imagine the cumulative man hours spent updating the files that could otherwise be more productively utilized. Equally, the more complex the logic within those Excels, the more likely it is to go wrong, posing a material risk to investment decision making.

Which pipeline tools work best?

From a shortlist of the most recent tools available, QuantSpark’s team selected two options to test: AWS Glue’s DataBrew, a no-code too designed to clean and normalise data to prepare it for analysis or machine learning models, and PySpark, an API script that allows large amounts of data to be processed through Apache Spark.

Both have respective pros and cons depending on their use cases.

AWS Glue DataBrew

Pros:

  • Designed to make the visualization of data a simple and straightforward process

  • Capable of easily ingesting data from Excel sources

Cons:

  • As a ‘no-code tool’, it can be harder to achieve more complex functions & calculations.

PySpark

Pros:

  • Capable of flexibly handling most functions whether they are simple or complex.

  • An ideal tool for automating tasks, delivering solid productivity gains to business teams

Cons:

  • Since this is a traditional coding script there are typically additional steps or modules required steps to ingest data from Excel input.

  • This is usually an unexpected step for teams that are new to using PySpark and something that will require extra research.

QuantSpark’s recommendation

Our recommendation for most financial-services use cases, especially those building their first non-Excel pipelines, is to use PySpark. Its flexibility allows the tool to be used across a range of different pipelines and tasks, no matter the complexity. The main technical challenge is being aware of the possibility of needing additional files and modules to set up the pipeline – a burden when used for the first time but easily surmounted for future teams if the process is properly documented.

Having downloaded these extra modules, creating a pipeline is simply a matter of writing a PySpark script and pointing the relevant data to the chosen storage location. The script allows for multiple automation possibilities, ranging from calculations previously carried out in Excel (benefitting investment analysts) to scheduling data refreshes (benefitting the tech team).

In summary, for data pipelines of all sizes servicing similar financial services companies, PySpark proved to be an ideal tool.

To discuss how we use advanced analytics for recurring revenue, contact us.

Get in touch

Are you looking for a team with deep expertise in advanced analytics and modelling techniques to drive value in your business? We can support you.

Related areas of QuantSpark expertise

Similar Case Studies

 
Previous
Previous

Improved forecasting and automated regulatory reporting: two simple use cases for analytics that drive impact within Financial Services

Next
Next

Building data cubes for SaaS clients on subscriptions, product usage and marketing attributions