Analysis of trade-offs between performance and energy efficiency of scalable dataframes tools

Utilize este identificador para referenciar este registo: https://hdl.handle.net/1822/92591

Título:	Analysis of trade-offs between performance and energy efficiency of scalable dataframes tools
Autor(es):	Martins, André Carvalho da Cunha
Orientador(es):	Vilaça, Ricardo Manuel Pereira
Palavras-chave:	Dataframe Distributed and parallel computing Performance Energy consumption Benchmark Computação distribuída e paralela Consumo energético
Data:	27-Nov-2023
Resumo(s):	Nowadays, we have the ability to trace everything, to extract valuable data from wherever we want, all to keep us connected and to improve our lifestyle. This huge amount of information, produced every day, needs to be treated, manipulated, and analysed, requiring convincing data structures to do so. Dataframes, regularly used worldwide, are powerful data structures used to analyse and manipulate data of any kind. A Dataframe organizes data into a 2-dimensional table of rows and columns, similar to SQL tables or CSV files. Furthermore, it can span alongside thousands of computers or servers, making it easier to work with huge amounts of data, called big data, using distributed systems and parallel computing. This Dataframe’s distributed nature led to the rise of distinct scalable and parallel Dataframe tools. The most used Dataframe tool, pandas, only performs on sequential execution and has some limitations when there is the need to handle huge volumes of data, and some tools such as Modin, Polars, RAPIDS, and so forth, appeared in order to overcome those limitations. The vast offer of these scalable tools brought the need to make an analysis and comparison between these frameworks and pandas, studying their behaviour and results with different workflows. This comparison is not linear and there is a need to use a benchmarking tool, in order to produce a homogeneous and reliable evaluation of the different frameworks. To perform this analysis, we worked with several workflows, manipulating real and synthetically produced data on distributed and parallel environments and on different hardware configurations. We designed and developed a benchmarking tool that supports a set of Dataframe frameworks, is flexible to the addition of new frameworks, and is able to perform micro-benchmarking evaluation with the analysis of a group of individual and common operations used on data science, and macro-benchmarking evaluation with the analysis of workflows that represent a set of chained operations. Both of these evaluations aggregate performance and energy consumption results for each framework. Hoje em dia, é possível extrair dados de onde quer que queiramos, mantendo-nos todos conectados e con tribuindo para um melhor estilo de vida. Esta quantidade enorme de informação, produzida diariament, precisa de ser tratada, manipulata e analisada, precisando de recorrer a estruturas de dados capazes de fazê-lo. Dataframes, utilizados globalmente, são uma poderosa estrutura de dados capaz de analisar e manipular qualquer tipo de dados. Organiza-se numa tabela de 2 dimensões de columas e linhas, como uma tabela de SQL ou um ficheiro CSV. Um Dataframe consegue ser dividido por múltiplos servidores, facilitando o trabalho com enormes quantidades de dados, uma vez que é possível utilizar computação paralela. Esta característica de paralelismo dos Dataframes levou ao aparecimento de várias ferramentas escaláveis e distribuídas. A ferramenta de Dataframes mais utilizada, pandas, apenas é capaz de executar sequen cialmente, tendo algumas limitação quando há a necessidade de trabalhar com enormes quantidades de dados, e algumas ferramentas como o Modin, Polars, Rapids, entre outros, apareceram para superar essas mesmas limitações. A oferta vasta destas ferramentas escaláveis trouxe a necessidade de fazer uma análise e comparação entre estas ferramentas e o pandas, estudando o seu comportamento e re sultados com diferentes workdflows. Esta comparação não é linear e existe a necessidade de utilizar uma ferramenta de benchmarking, para gerar uma avaliação homogénea e fiável. Para fazer esta análise, trabalhamos com workflows de vários tipos, manipulatdos dados reais e sinteticamente produzidos em ambientes distribuídos e em diferentes configurações de hardware. Prototipamos e desenvolvemos uma ferramenta de benchmarking, com suporte a várias ferramentas distribuídas, e flexível à adição de novas ferramentas, que é capaz de realizar avaliações de micro benchmarking, com a análise de operações individuais, e macro-benchmarking, com a análise de work flows que representam um conjunto de operações encadeadas. Ambas as avaliações agregam resultados sobre a performance e o consumo energético de cada framework.
Tipo:	Dissertação de mestrado
Descrição:	Dissertação de mestrado em Informatics Engineering
URI:	https://hdl.handle.net/1822/92591
Acesso:	Acesso aberto
Aparece nas coleções:	BUM - Dissertações de Mestrado DI - Dissertações de Mestrado