Utilising Neo4J for analysis of Twitch dataset.
Introduction
Neo4j is a database management system that is frequently used for graph-based storage and retrieval of intricately interconnected data. It excels at effective analysis and representation of complex relationships between data elements by utilizing the property graph model. Neo4j stores data differently than traditional relational databases, allowing for easier querying and pattern analysis. Additionally, it offers Cypher, a querying language that makes it easier to draw conclusions from graph data. Additionally, a variety of plugins are available for graph visualization and data science operations.
This blog’s main goal is to provide an overview of the results obtained by executing various queries on a dataset of Twitch streamers using Neo4j. Readers can click the “Clustering using GDS” link to go straight to the clustering section.
Set-Up
The dataset for the investigation had 6.5 million edges and over 150,000 nodes. Cypher’s LOAD_CSV function would take a lot of time to use, the researchers found. Nevertheless, they found through experimentation that Neo4j’s admin-import terminal command allowed them to set up the network in under a minute.
The aforementioned command can load a network with 168,114 nodes and 6,797,557 edges while maintaining all node properties in about 20 seconds. However, loading just the edges with the LOAD_CSV function would take more than 40 minutes.
It is important to note that in addition to the headers in the command, the nodes_header.csv and edges_header.csv files contained data-type information. Due to the fact that entries are by default treated as strings, this is crucial for ensuring appropriate data import.
Cypher Queries
Due to its capacity to manage intricate data relationships and produce insightful results from huge datasets, Cypher, Neo4j’s query language, is commonly utilised in data science applications. Data scientists may develop complex queries that are simple for others to understand because Cypher is made to be flexible and user-friendly.
You can run the command “match (n)” to see a list of every node in the network. The amount of nodes that can be displayed in a single query in the Neo4j Browser is limited, although this restriction can be changed. The total number of nodes that can be displayed on the screen at once will therefore go below this limit, and some nodes might not be visible.
The following Cypher command may be used to get the top 10 nodes for this dataset depending on the number of connections:
“match (s)-[]->(t) return s.numeric_id, size(collect(t)) as connections order by connections desc limit 10”
To set the criteria as the number of views, the Cypher command would be:
“match (n) return n.numeric_id, n.views as gamers order by n.views desc limit 10”
Clustering involves classifying related nodes according to predetermined standards. Graph Data Science (GDS), a plugin offered by Neo4j, has a number of clustering techniques grouped under the heading “Community Detection.” The Louvain community discovery tool from GDS was applied to this dataset to produce 19 unique clusters.
To save the network as a graph, the following command can be used:
“CALL gds.graph.project.cypher(‘twitch’,’MATCH (n) RETURN id(n) AS id, n.views AS views’,’MATCH (n)-[]->(m) RETURN id(n) AS source, id(m) AS target’) YIELD graphName, nodeCount AS nodes, relationshipCount AS rels RETURN graphName, nodes, rels”
This command saves a graph in the current runtime with the name “twitch” along with the specified features.
The Louvain clustering method can be invoked using:
“call gds.louvain.write(‘twitch’, {writeProperty:’louvain’})”
By running this command, the Louvain clustering algorithm is applied, and the result is saved as a node attribute with the label “louvain.” The node attribute can be changed into a node label using the following code in order to display clusters separately:
“match (n) call apoc.create.addLabels([id(n)], [toString(n.louvain)]) yield node with node remove node.louvain return node”
Conclusion
Graph-based complicated interrelated data can be stored and retrieved using the well-liked database management system Neo4j. The Cypher query language and the property graph model are used to analyse and visualise complex relationships between data elements. The blog demonstrates the clustering and prediction results obtained by executing various queries on a dataset of Twitch streams using Neo4j. Neo4j’s admin-import terminal programme was used to load the study’s dataset, which contained approximately 150,000 nodes and 6.5 million edges and preserved all node properties. All nodes were displayed using Cypher queries, the top 10 nodes based on connections were obtained, and the number of views was utilised as the criterion.Louvain clustering was carried out with Neo4j’s Graph Data Science (GDS) plugin, which produced 19 different clusters. The blog offers commands for the Louvain clustering technique, saving the network as a graph, and changing node properties into node labels.