Data Processing Using Functions in Prosto: An Alternative to Map-Reduce and SQL
We normally want to add a new column to an existing table containing groups. Prosto provides a new solution: aggregation is a column operation rather than a set operation.
Join the DZone community and get the full member experience.Join For Free
Why Prosto? Having Only Set Operations Is Not Enough
Typical data processing tasks have to access and analyze data stored in multiple tables. They could be called relations, collections, or sets in different systems but we will refer to them as tables for simplicity. The general task of data processing is to derive new data from these multiple tables and each solid data (processing) model must answer the following three important questions: how to compute new columns within one table, how to link tables and how to aggregate data. Below we shortly describe how these tasks are solved in a traditional set-oriented model and where these solutions have significant flaws.
Calculation. Given a table, we frequently need to add a new column with values computed from other columns in this same table. Conceptually, the task is similar to defining a cell in a spreadsheet, for example,
C1=A1+B1. Easy and natural? Yes. However, it is not so easy in traditional data processing frameworks. The main problem is that we need to define a new table because adding a column to an existing table is not possible. Prosto toolkit is intended to fix this flaw by providing a dedicated operation where a new column can be added as in this example:
Linking. Data in different tables exist in isolated form and the system is unaware of possible relationships between its elements. Therefore, we cannot rely on the system when we need to process data residing in these tables. For example, if we have state abbreviation (like AL) in our table but for the analysis, we need state names (like Alaska) that reside in another table then the system has to be taught how these two tables are related. After that, it can access names given abbreviations. The standard way to do this is to join two tables by providing two specifications: 1) criteria of joining (a condition) and 2) creating a new table that uses these criteria to collect all the necessary data into one output table. This approach has some significant drawbacks: 1) we actually did not say that we want to create a new table — in many cases, it is not needed at all (neither conceptually nor physically) 2) one join criterion can be (and in most cases will be) reused in many queries but we cannot factor it out because it is how this operation is defined (the problem of separation of concerns). Prosto solves this problem by providing means for defining link conditions as derived columns with no unnecessary tables and allowing for their reuse in many other queries where the linked data has to be accessed.
Aggregation. Analytical workloads are based on aggregating data stored in many different records. The standard solution is based on applying Groupby operation which needs two constituents: 1) criteria for grouping records into subsets, 2) operation for computing one value from a subset of values in a group. Groupby returns a new table although we actually do not need it. Rather, we normally want to add a new column to the existing table containing groups. Prosto provides a new solution where the aggregation is a column operation (rather than a set operation).
What is Prosto: An Alternative to Map-Reduce and SQL
Prosto  is a Python data processing toolkit to programmatically author and execute complex data processing workflows. Conceptually, it is an alternative to purely set-oriented approaches to data processing like map-reduce, relational algebra, SQL, or data-frame-based tools like Pandas.
Prosto radically changes the way data is processed by relying on a novel data processing paradigm: a concept-oriented model of data [2, 3]. It treats columns (modeled via mathematical functions) as first-class elements of the data processing pipeline having the same rights as tables. If a traditional data processing graph consists of only set operations then the Prosto workflow consists of two types of operations:
- Table operations produce (populate) new tables from existing tables. A table is an implementation of a mathematical set which is a collection of tuples.
- Column operations produce (evaluate) new columns from existing columns. A column is an implementation of a mathematical function that maps tuples from one set to another set.
The concept-oriented model was also implemented in the Lambdo  toolkit which however has a somewhat different focus: its goal is to unify feature engineering and machine learning.
In the next sections, we will describe how Prosto and its column operations can be applied to COVID-19 statistical data. The example used in this article is implemented in a Jupyter notebook which can be checked out from the Prosto git repository: prosto/covid.ipynb at master · asavinov/prosto (github.com).
Analyzing COVID-19 Statistics
The data is stored in two CSV files:
states.csv. Fig. 1 shows the columns of these tables which will be used for analysis. Solid lines represent source fields that are stored in the file and dashed lines represent derived columns which we will be computed using Prosto operations. Importantly, no new tables will be produced — only new columns will be added to existing tables.
Our goal is to compute three new columns as depicted in Fig. 1. First, we want to compute the death to recovered ratio as a calculated column of the
all-states-history table. Second, we want to compute a link column the values of which will point to the records of the
states table. Third, we want to compute the mean death to the recovered ratio for each state in the
states table by aggregating data in the
Prosto can be installed like most other Python packages:
Alternatively, the project source code can be checked out from git and then the checked-out version installed from the project root:
Analysis of data in an application or script starts from importing the package and instantiating a workflow object:
This object is then used to add operations and finally execute the whole workflow. Prosto provides many different operations including table operations (which generate new tables) but we will consider only three column operations: populate, calculate, link and aggregate.
Before data in a workflow can be processed, the source tables have to be populated. We assume that the source data is loaded from CSV files into pandas data frames. It is also necessary to provide attribute names which in our case are column names of the CSV file. The source data is added to the workflow via the populate operation:
In the same way, we populate the second table with state names:
In the next sections, we will show how to add calculate, link and aggregate operations.
Calculate Column Instead of A New Table
The first operation is intended for computing new column values at the level of one row. For each row of a table, the new column value is computed from the values of some other columns of the same row. In our example, our goal is to compute the death to recovered ratio which is done as follows:
Prosto will only add this operation definition by taking into account dependencies but without execution.
Link Column Instead of Join
Here the goal is to make data in the
states table available from the
all-states-history and vice versa. A typical solution consists in creating a new table with data from both tables by applying the join operation. Prosto does not use joins: either at the conceptual or the logical level. (Yet, joins are actually used at the physical level because Pandas provides a very efficient implementation for matching records in frames). Prosto provides a link operation for defining a new link column. It will be treated as a normal (derived) column which can be used like any other column in other operations like calculate or aggregate.
A new link column in our example is defined as follows:
This column is a conceptual replacement of the join operation. More specifically, it replaces the join condition. In the next section, we will show how this link column is used in an aggregate operation.
Aggregate Column Instead of Group-By
Our goal is to add a new column to the
states table so that each state record stores the mean death to recovered ratio for this state. The mean value is computed for all records from the
all-states-history belonging to this state. In Prosto, this operation is defined as follows:
Note that this column added to the
states table but it is computed from data in the
all-states-history table. The grouping is performed by means of the previously defined link column. In contrast to the traditional set-oriented approach based on Groupby, we do not create a new table by directly computing what is needed.
Executing a Workflow and Getting the Results
A workflow object stores only definitions. All operations in the workflow are evaluating according to their dependencies by calling the following functions:
Now we can retrieve the results:
Which will contain the following table:
Note that it is the same original
states table but now it gets one additional column with the results of aggregation over the records in the
Prosto is a data processing toolkit that is based on using column operations for data transformations. It is based on the concept-oriented data model where mathematical functions are used along with sets for data representation and transformation. This approach has the following benefits:
- New tables are created only if really necessary while in many cases columns are defined and computed directly without unnecessary tables.
- In many cases, applying column operations is simpler and more natural in comparison to set operations.
- Column operations use data stored in source tables directly without unnecessary copy operations.
- Link columns are used instead of joins which are known to be difficult to understand.
- Aggregate columns are used instead of Groupby by allowing for direct definitions of how data has to be aggregated.
- Model object as a parameter of most operations which allows for defining features based on ML models.
 https://github.com/asavinov/prosto Prosto data processing toolkit: an alternative to map-reduce and join-groupby.
: A.Savinov. On the importance of functions in data modeling, Eprint: arXiv:2012.15570 [cs.DB], 2019.
: A.Savinov. Concept-oriented model: Modeling and processing data using functions, Eprint: arXiv:1606.02237 [cs.DB], 2019.
 https://github.com/asavinov/lambdo Feature engineering and machine learning: together at last!
Opinions expressed by DZone contributors are their own.