DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • Send Email Using Spring Boot (SMTP Integration)
  • How To Manage Vulnerabilities in Modern Cloud-Native Applications
  • Using OpenAI Embeddings Search With SingleStoreDB
  • Database Integration Tests With Spring Boot and Testcontainers

Trending

  • Send Email Using Spring Boot (SMTP Integration)
  • How To Manage Vulnerabilities in Modern Cloud-Native Applications
  • Using OpenAI Embeddings Search With SingleStoreDB
  • Database Integration Tests With Spring Boot and Testcontainers
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Migration From JanusGraph to Nebula Graph - Practice at 360 Finance

Data Migration From JanusGraph to Nebula Graph - Practice at 360 Finance

In this article, take a look at data migration from JanusGraph to Nebula Graph.

Black Zhou user avatar by
Black Zhou
·
Sep. 10, 20 · Tutorial
Like (4)
Save
Tweet
Share
6.23K Views

Join the DZone community and get the full member experience.

Join For Free

Speaking of graph data processing, we have had experience in using various graph databases. In the beginning, we used the stand-alone edition of AgensGraph. Later, due to its performance limitations, we switched to JanusGraph, a distributed graph database. I introduced details on how to migrate data in my article “Migrate tens of billions of graph data into JanusGraph (only in Chinese)”. As the data size and the number of business calls grew, a new problem appeared: Each query consumed too much time. In some business scenarios, a single query took up to 10 seconds, and with increase of the data size, a more complicated single query needed two or three seconds. These problems had seriously affected the performance of the entire business process and the development of related businesses.

The architecture design of JanusGraph determines that a single query is time-consuming. The core reason is that its storage depends on the external storage, and JanusGraph cannot control the external storage well. In our production environment, an HBase cluster is used, which makes it impossible for all queries to be pushed down to the storage layer for processing. Instead, data can only be queried from HBase to the JanusGraph Server memory and then filtered accordingly.

For example, in a dataset, we want to find out users older than 50 having the one-hop relationship with a specified user: 1,000 users have such a relationship with the user, but only two of them are older than 50. In JanusGraph, when the query requests are sent to HBase, the vertices of one-hop relationship cannot be filtered by their properties. Therefore, we have to use concurrent requests to query HBase to obtain the properties of these 1,000 people, filter them in the memory of JanusGraph Server, and then the two users who meet the conditions are returned to the client.

Such operations may cause a lot of waste of disk I/O and network I/O, and most of the data returned for the query is not used in subsequent queries. The HBase used in our production environment uses 19 high-profile SSD servers. The specific network I/O and disk I/O are as follows.

HBase network I/O

HBase disk I/O

In the same business scenario, if we use Nebula Graph to process graph data, only 6 SSD servers with the same configuration are needed. The disk I/O and network I/O are as follows.

Nebula Graph network I/O

Nebula Graph disk I/O

From the previous comparison, it can be seen that the performance of Nebula Graph is much better. It is especially important to note that this performance is achieved when the machine resources are only 30% of the HBase cluster. Let’s take a look at the time consumption in the business scenarios: In a business scenario where JanusGraph needs two or three seconds to process a query, Nebula Graph only takes about 100 ms; in a business scenario where JanusGraph requires 10 to 20 seconds, Nebula Graph only takes about two seconds. Moreover, in Nebula Graph, the average time consumption is about 500 ms, and the performance is improved to at least 20 times. 

cat time consumed

If you are still using JanusGraph, after reading this performance comparison, I guess you will forward this article to your team immediately, requesting a project to migrate graph data to Nebula Graph.

Migration of Historical Data

Now let’s talk about how to migrate data. Our data size is relatively large, about 2 billion vertices and 20 billion edges. It is difficult to migrate data of such a large size. Lucky to us, Nebula Graph provides a Spark-based import tool, Spark Writer. This tool facilitates our process of data migration so much. We have the experience to share with you: The asynchronous data import may not be what you need, because it may introduce a lot of errors. We recommend that you change the import method to synchronous writing. Here is another experience about using Spark: If the amount of imported data is relatively large, the partitions parameter should be set to a great value. In our case, we tried setting this value to 80,000. If the value you set is less than what is supposed to be, the data size of a single partition will be relatively large, which may easily cause OOM Fail to your Spark tasks.

Query Tuning

We are now using Nebula Graph 1.0 in the production environment. In this environment, we use the HASH() function instead of the UUID() function to generate IDs, because the UUID() function consumes more time during the data import process. Besides, it is said that Nebula Graph will no longer support UUID() in the future.

In our production environment, the major tuning configurations are as follows. Most of them are for the tuning of nebula-storage.

Shell
 




xxxxxxxxxx
1
39


 
1
# The default reserved bytes for one batch operation
2

          
3
--rocksdb_batch_size=4096
4

          
5
# The default block cache size used in BlockBasedTable
6

          
7
# The unit is MB. The memory capacity of our production server is 128 GB
8

          
9
--rocksdb_block_cache=44024
10

          
11
############## rocksdb Options ##############
12
--rocksdb_disable_wal=true
13

          
14
# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma
15

          
16
--rocksdb_db_options={"max_subcompactions":"3","max_background_jobs":"3"}
17

          
18
# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
19

          
20
--rocksdb_column_family_options={"disable_auto_compactions":"false","write_buffer_size":"67108864","max_write_buffer_number":"4","max_bytes_for_level_base":"268435456"}
21

          
22
# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
23

          
24
--rocksdb_block_based_table_options={"block_size":"8192"}
25

          
26
--max_handlers_per_req=10
27
--heartbeat_interval_secs=10
28

          
29
# Newly added parameters
30

          
31
--raft_rpc_timeout_ms=5000
32
--raft_heartbeat_interval_secs=10
33
--wal_ttl=14400
34
--max_batch_size=512
35

          
36
# Parameters to reduce memory usage
37

          
38
--enable_partitioned_index_filter=true
39
--max_edge_returned_per_vertex=10000


Regarding the tuning of the Linux server: The main configuration is disabling the swap of the service, because if the swap is enabled, the disk I/O may reduce the query performance. In addition, about the minor compact and major compact tuning: In our production environment, we enabled minor compact but disabled major compact. If major compact is enabled, it may take up a lot of disk I/O, and it is difficult to control it by setting the number of threads (--rocksdb_db_options={"max_subcompactions":"3","max_background_jobs":"3"}). It is said that this function will be optimized in the future versions of Nebula Graph.

Finally, let’s praise the max_edge_returned_per_vertex parameter of Nebula Graph. In my opinion, with this one parameter alone, Nebula Graph can be recommended as the veteran in the graph database industry. Our previous graph queries have always been troubled by the super vertices that have millions of edges with other vertices. In the production environment, if you want to query these vertices in JanusGraph together with HBase, a crash may be part of your routine work. In our production environment it happened several times. When using JanusGraph, we cannot solve this problem well by adding various LIMIT clauses with Gremlin statements. However, with Nebula Graph, the max_edge_returned_per_vertex parameter appears as a savior. With this parameter, Nebula Graph filters the data directly in the underlying storage layer, which saves us from the super vertex trouble in the production environment. Based on this parameter alone, we give NebulaGraph a FIVE STAR!

Graph (Unix) Data (computing) Nebula (computing) Data migration Database

Published at DZone with permission of Black Zhou. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • Send Email Using Spring Boot (SMTP Integration)
  • How To Manage Vulnerabilities in Modern Cloud-Native Applications
  • Using OpenAI Embeddings Search With SingleStoreDB
  • Database Integration Tests With Spring Boot and Testcontainers

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: