Cost optimization of data flows based on task re-ordering

Authors: Georgia Kougka and Anastasios Gounaris

Volume 33 (2017)


Analyzing big data with the help of automated data flows attracts a lot of attention because of the growing need for end-to-end processing of this data. Modern data flows may consist of a high number of tasks and it is difficult for flow designers to define an efficient execution order of the tasks manually given that, typically, there is significant freedom in the valid positioning for some of the tasks. Several automated execution plan enumeration techniques have been proposed. These solutions can be broadly classified into three categories, each having significant limitations: (i) the optimizations are based on rewrite rules similar to those used in databases, such as filter and projection push-down, but these rules cover only the flow tasks that correspond to extended relational algebra operators. To cover arbitrary tasks, the solutions (ii) either rely on simple heuristics, or (iii) they exhaustively check all orderings, and thus cannot scale. We target the second category and we propose an efficient and polynomial cost-based task ordering solution for flows with arbitrary tasks seen as black boxes. We evaluated our proposals using both real runs and simulations, and the results show that we can achieve speed-ups of orders of magnitude, especially for flows with a high number of tasks even for relatively low flexibility in task positioning.