Exploring the S and P 100 Index Stocks Using Graph Machine Learning
Explore how to use the Java-based graph analysis library JGraphT and the diagramming library mxGraph to visualize the changes of correlation between the S&P 100 Index stocks over time.
Join the DZone community and get the full member experience.Join For Free
In this post, we are going to explore how to use the Java-based graph analysis library JGraphT  and the diagramming library mxGraph  to visualize the changes of correlation between the S&P 100 Index stocks  over time.
The main analysis method used in this post refers to  and . There are two datasets: the vertex set and the edge set.
Stock Data (Vertex Set)
Select the following S&P 100 Index stocks, and model each stock as a vertex. The properties of each vertex are the stock code and the industry of the listed company.
Table 1: The vertex set sample
|Vertex ID||Code||Industry||Company Name|
|1||AAPL||Consumer Electronics||Apple Inc. (AAPL)|
|2||ABBV||Drug Manufacturers—General||AbbVie Inc. (ABBV)|
|3||ABT||Medical Devices||Abbott Laboratories (ABT)|
|4||ACN||Information Technology Services||Accenture plc (ACN)|
|5||AIG||Insurance—Diversified||American International Group, Inc. (AIG)|
|6||ALL||Insurance—Property & Casualty||The Allstate Corporation (ALL)|
|7||AMGN||Drug Manufacturers—General||Amgen Inc. (AMGN)|
|8||AMT||REIT—Specialty||American Tower Corporation (REIT) (AMT)|
|9||AMZN||Internet Retail||Amazon.com, Inc. (AMZN)|
Stock Relationship (Edge/Relationship Set)
Each edge has only one property, i.e. the weight. The weight of an edge indicates the daily return rate similarity of the listed companies represented by the vertices (stocks) on both ends of the edge. The similarity algorithm refers to  and : Analyze the time-series correlation of the individual stocks’ daily return rateover a period of time (from 2014-01-01 to 2020-01-01), and define the distance between the individual stocks (i.e. the edge weight) as follows.
Through such processing, we get the range of the distance as [0,2]. That means the more distant the two stocks, the lower the correlation between their return rates.
Table 2: The edge set sample
|Source vertex ID||Target vertex ID||Edge weight|
Such a vertex set and an edge set form a graph network, which can be stored in the Nebula Graph database.
JGraphT is an open-source Java class library that provides not only a variety of efficient and common graph data structures but also many useful algorithms for solving the most common graph problems. The features of JGraphT are as follows:
- Supports directed edges, undirected edges, weighted edges, non-weighted edges, etc.
- Supports simple graphs, multigraphs, and pseudographs.
- Provides dedicated iterators (DFS, BFS, etc.) for graph traversals.
- Provides a large number of commonly used graph algorithms, such as path lookup, isomorphism detection, coloring, common ancestors, wandering, connectivity, matching, cycle detection, partitioning, cutting, flow, and centrality.
- Provides easy import from/export to GraphViz. The exported GraphViz file can be applied by the visualization tool Gephi for analysis and demonstration.
- Supports convenient graph network generating while used with tools such as JGraphX，mxGraph，and Guava Graphs Generators.
Next, let’s try it out.
- Create a directed graph in JGraphT.
- Add vertices.
- Add edges.
The Nebula Graph Database
JGraphT usually uses local files as data sources. That’s fine when you’re doing static network research, but if the graph network is constantly changing, like a graph network of the stock data, which changes every day, it’s a little bit of a hassle to generate a new static file, then load it and then analyze it every time. Ideally, the entire change process can be written to a database persistently, and subgraphs or the complete graph can be loaded directly from the database in real-time for analysis. In this post, we use Nebula Graph as the graph database for storing the graph data.
Nebula Graph’s Java client nebula-java  provides two ways of accessing Nebula Graph. One is interacting with the query engine layer  through the graph query language nGQL , which is usually suitable for accessing subgraphs and supports complex semantics; The other is to directly interact with the underlying storage layer (the storaged process)  through the APIs, which is used to obtain the complete set of vertices and edges. In addition to accessing Nebula Graph itself, nebula-java provides examples of interacting with Neo4j , JanusGraph , Spark , and others.
In this post, we use the APIs to access the storage layer (the storaged process) directly to get all the vertices and edges. The following two interfaces can be used to read all the vertex and edge data.
- Initialize a client and a ScanVertexProcessor. ScanVertexProcessor is used to decode the read vertex data:
- Call the scanVertex interface, which returns an iterator for the scanVertexResponse object:
- Keep reading the data in the scanVertexResponse object that the iterator points to until all of the data is read. The read vertex data is saved and later added to the graph structure of JGraphT.
Reading edge data is similar to the above process.
Analyze the Graph in JGraphT
- Create an undirected and weighted graph in JGraphT :
- Add the vertex and edge data read in the last step from the Nebula Graph space to the graph:
- Like the analysis algorithms mentioned in  and , use Prim’s minimum-spanning tree algorithm for the preceding graph and call the encapsulated drawGraph interface to draw the graph.
Prim’s algorithm is an algorithm in graph theory that searches for a minimum spanning tree in a weighted connected graph. In other words, the tree formed by the edge subset searched by this algorithm not only includes all vertices in the connected graph but also has the minimum sum of weights of all edges.
- The drawGraph method encapsulates parameter settings such as the layout of the drawing. This method renders stocks in the same sector in the same color, grouping close stocks together.
- Visualize the data.
The color of each vertex in Figure 1 represents its industry. We can see that the stocks with high business similarity have been clustered together, but some stocks with no obvious correlation have also been clustered together, the reason for which needs to be studied separately.
Figure 1: Clustering based on the stock data from 2014-01-01 to 2020-01-01
- Some other dynamic exploration based on different time windows.
The preceding conclusion is mainly based on the stock aggregation from 2014-01-01 to 2020-01-01. We also made other attempts: Use a sliding window of 2 years and the same analysis method to observe if the clustered groups would change over time.
Figure 2: Clustering based on the stock data from 2014-01-01 to 2016-01-01
Figure 3: Clustering based on the stock data from 2015-01-01 to 2017-01-01
Figure 4: Clustering based on the stock data from 2016-01-01 to 2018-01-01
Figure 5: Clustering based on the stock data from 2017-01-01 to 2019-01-01
Figure 6: Clustering based on the stock data from 2018-01-01 to 2020-01-01
This post should not be taken as investment advice. Due to the situation of trade suspension, circuit breakers, trading limits, transfers, mergers and acquisitions, changes of the main business, etc., the data processing in this post may be incorrect. We have not checked all the data piece by piece.
Limited by time, this post only selects the data of 100 stock samples in the past six years, and only adopts the method of minimum expansion tree to do the clustering and classification. In the future, maybe we can use larger data sets (such as U.S. stocks, derivatives, and digital currencies) to try more methods in machine learning.
For the code used in this post, please see .
 Analyzing Relationships in Game of Thrones With NetworkX, Gephi, and Nebula Graph (Part One) https://nebula-graph.io/posts/game-of-thrones-relationship-networkx-gephi-nebula-graph/
 Analyzing Relationships in Game of Thrones With NetworkX, Gephi, and Nebula Graph (Part Two) https://nebula-graph.io/posts/game-of-thrones-relationship-networkx-gephi-nebula-graph-part-two/
 NetworkX: a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. https://networkx.github.io/
 Nebula Graph: A powerfully distributed, scalable, lightning-fast graph database written in C++. https://nebula-graph.io/
 JGraphT: a Java library of graph theory data structures and algorithms. https://jgrapht.org/
 Bonanno, Giovanni & Lillo, Fabrizio & Mantegna, Rosario. (2000). High-frequency Cross-correlation in a Set of Stocks. arXiv.org, Quantitative Finance Papers. 1. 10.1080/713665554.
 Mantegna, R.N. Hierarchical structure in financial markets. Eur. Phys. J. B 11, 193–197 (1999).
 Nebula Graph Query Language (nGQL). https://docs.nebula-graph.io/manual-EN/1.overview/1.concepts/2.nGQL-overview/
 Nebula Graph Query Engine. https://github.com/vesoft-inc/nebula-graph
 Nebula-storage: A distributed consistent graph storage. https://github.com/vesoft-inc/nebula-storage
 Neo4j. www.neo4j.com
 JanusGraph. janusgraph.org
 Apache Spark. spark.apache.org.
Published at DZone with permission of Jamie Liu. See the original article here.
Opinions expressed by DZone contributors are their own.