DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Useful System Table Queries in Relational Databases
  • Fine-Tuning Performance, Resolving Common Issues in FinTech Application With MySQL
  • Optimizing Data Management: Migrating From AWS RDS MySQL to Snowflake
  • Common Mistakes to Avoid When Writing SQL Code

Trending

  • Implementing API Design First in .NET for Efficient Development, Testing, and CI/CD
  • A Guide to Auto-Tagging and Lineage Tracking With OpenMetadata
  • Agile’s Quarter-Century Crisis
  • Traditional Testing and RAGAS: A Hybrid Strategy for Evaluating AI Chatbots
  1. DZone
  2. Data Engineering
  3. Databases
  4. How to Avoid Hash Collisions When Using MySQL’s CRC32 Function

How to Avoid Hash Collisions When Using MySQL’s CRC32 Function

By 
Peter Zaitsev user avatar
Peter Zaitsev
·
Oct. 24, 14 · Interview
Likes (0)
Comment
Save
Tweet
Share
6.7K Views

Join the DZone community and get the full member experience.

Join For Free

Originally Written by Arunjith Aravindan

Percona Toolkit’s pt-table-checksum performs an online replication consistency check by executing checksum queries on the master, which produces different results on replicas that are inconsistent with the master – and the tool pt-table-sync synchronizes data efficiently between MySQL tables.

The tools by default use the CRC32. Other good choices include MD5 and SHA1. If you have installed the FNV_64 user-defined function, pt-table-sync will detect it and prefer to use it, because it is much faster than the built-ins. You can also use MURMUR_HASH if you’ve installed that user-defined function. Both of these are distributed with Maatkit. For details please see the tool’s documentation.

Below are test cases similar to what you might have encountered. By using the table checksum we can confirm that the two tables are identical and useful to verify a slave server is in sync with its master. The following test cases with pt-table-checksum and pt-table-sync will help you use the tools more accurately.

For example, in a master-slave setup we have a table with a primary key on column “a” and a unique key on column “b”. Here the master and slave tables are not in sync and the tables are having two identical values and two distinct values. The pt-table-checksum tool should be able to identify the difference between master and slave and the pt-table-sync in this case should sync the tables with two REPLACE queries.

+-----+-----+    +-----+-----+
|  a  |  b  |    |  a  |  b  |
+-----+-----+    +-----+-----+
|  2  |  1  |    |  2  |  1  |
|  1  |  2  |    |  1  |  2  |
|  4  |  3  |    |  3  |  3  |
|  3  |  4  |    |  4  |  4  |
+-----+-----+    +-----+-----+

Case 1:  Non-cryptographic Hash function (CRC32) and the Hash collision.

The tables in the source and target have two different columns and in general way of thinking the tools should identify the difference. But the below scenarios explain how the tools can be wrongly used and how to avoid them – and make things more consistent and reliable when using the tools in your production.

The tools by default use the CRC32 checksums and it is prone to hash collisions. In the below case the non-cryptographic function (CRC32) is not able to identify the two distinct values as the function generates the same value even we are having the distinct values in the tables.

CREATE TABLE `t1` (
  `a` int(11) NOT NULL,
  `b` int(11) NOT NULL,
  PRIMARY KEY (`a`),
  UNIQUE KEY `b` (`b`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Master           Slave
+-----+-----+    +-----+-----+
|  a  |  b  |    |  a  |  b  |
+-----+-----+    +-----+-----+
|  2  |  1  |    |  2  |  1  |
|  1  |  2  |    |  1  |  2  |
|  4  |  3  |    |  3  |  3  |
|  3  |  4  |    |  4  |  4  |
+-----+-----+    +-----+-----+
Master:
[root@localhost mysql]#  pt-table-checksum --replicate=percona.checksum --create-replicate-table --databases=db1 --tables=t1
localhost --user=root --password=*** --no-check-binlog-format
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
09-17T00:59:45      0      0        4       1       0   1.081 db1.t1
Slave:
[root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=***
 --verbose  --sync-to-master  192.**.**.**
# Syncing via replication h=192.**.**.**,p=...,u=root
# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE

Narrowed down to BIT_XOR:

Master:
mysql> SELECT BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) FROM `db1`.`t1`;
+------------------------------------------------------------+
| BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) |
+------------------------------------------------------------+
|                                                    6581445 |
+------------------------------------------------------------+
1 row in set (0.00 sec)
Slave:
mysql> SELECT BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) FROM `db1`.`t1`;
+------------------------------------------------------------+
| BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) |
+------------------------------------------------------------+
|                                                    6581445 |
+------------------------------------------------------------+
1 row in set (0.16 sec)

Case 2: As the tools are not able to identify the difference, let us add a new row to the slave and check if the tools are able to identify the distinct values. So I am adding a new row (5,5) to the slave.

mysql> insert into db1.t1 values(5,5);
Query OK, 1 row affected (0.05 sec)
Master           Slave
+-----+-----+    +-----+-----+
|  a  |  b  |    |  a  |  b  |
+-----+-----+    +-----+-----+
|  2  |  1  |    |  2  |  1  |
|  1  |  2  |    |  1  |  2  |
|  4  |  3  |    |  3  |  3  |
|  3  |  4  |    |  4  |  4  |
+-----+-----+    |  5  |  5  |
                 +-----+-----+
 [root@localhost mysql]#  pt-table-checksum --replicate=percona.checksum --create-replicate-table --databases=db1 --tables=t1
 localhost --user=root --password=*** --no-check-binlog-format
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
09-17T01:01:13      0      1        4       1       0   1.054 db1.t1
[root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=***
 --verbose  --sync-to-master  192.**.**.**
# Syncing via replication h=192.**.**.**,p=...,u=root
# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
DELETE FROM `db1`.`t1` WHERE `a`='5' LIMIT 1 /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**.
10,p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum
replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/;
REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('3', '4') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**,
p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum
 replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/;
REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('4', '3') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**,
p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum
replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/;
#      1       2      0      0 Chunk     01:01:43 01:01:43 2    db1.t1

Well, apparently the tools are now able to identify the newly added row in the slave and the two other rows having the difference.

Case 3: Advantage of Cryptographic Hash functions (Ex: Secure MD5)

As such let us make the tables as in the case1 and ask the tools to use the cryptographic (secure MD5) hash functions instead the usual non-cryptographic function. The default CRC32 function provides no security due to their simple mathematical structure and too prone to hash collisions but the MD5 provides better level of integrity. So let us try with the –function=md5 and see the result.

Master           Slave
+-----+-----+    +-----+-----+
|  a  |  b  |    |  a  |  b  |
+-----+-----+    +-----+-----+
|  2  |  1  |    |  2  |  1  |
|  1  |  2  |    |  1  |  2  |
|  4  |  3  |    |  3  |  3  |
|  3  |  4  |    |  4  |  4  |
+-----+-----+    +-----+-----+

Narrowed down to BIT_XOR:

Master:
mysql> SELECT 'test', 't2', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONCAT(LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING
(@crc, 1, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'), LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING(@crc := md5(CONCAT_WS('#', `a`, `b`))
, 17, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'))), 0) AS crc FROM `db1`.`t1`;
+------+----+---+------+------+------+-----+----------------------------------+
| test | t2 | 1 | NULL | NULL | NULL | cnt | crc                              |
+------+----+---+------+------+------+-----+----------------------------------+
| test | t2 | 1 | NULL | NULL | NULL |   4 | 000000000000000063f65b71e539df48 |
+------+----+---+------+------+------+-----+----------------------------------+
1 row in set (0.00 sec)
Slave:
mysql> SELECT 'test', 't2', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONCAT(LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING
(@crc, 1, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'), LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING(@crc := md5(CONCAT_WS('#', `a`, `b`))
, 17, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'))), 0) AS crc FROM `db1`.`t1`;
+------+----+---+------+------+------+-----+----------------------------------+
| test | t2 | 1 | NULL | NULL | NULL | cnt | crc                              |
+------+----+---+------+------+------+-----+----------------------------------+
| test | t2 | 1 | NULL | NULL | NULL |   4 | 0000000000000000df024e1a4a32c31f |
+------+----+---+------+------+------+-----+----------------------------------+
1 row in set (0.00 sec)
[root@localhost mysql]# pt-table-checksum --replicate=percona.checksum --create-replicate-table --function=md5 --databases=db1
 --tables=t1 localhost --user=root --password=*** --no-check-binlog-format
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
09-23T23:57:52      0      1       12       1       0   0.292 db1.t1
[root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=amma
 --verbose --function=md5 --sync-to-master  192.***.***.***
# Syncing via replication h=192.168.56.102,p=...,u=root
# DELETE REPLACE INSERT UPDATE ALGORITHM START    END      EXIT DATABASE.TABLE
REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('3', '4') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.168.56.101,p=...,
u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.***.***.***,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum
replicate:percona.checksum bidirectional:0 pid:5608 user:root host:localhost.localdomain*/;
REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('4', '3') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.168.56.101,p=...,
u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.***.**.***,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum
replicate:percona.checksum bidirectional:0 pid:5608 user:root host:localhost.localdomain*/;
#      0       2      0      0 Chunk     04:46:04 04:46:04 2    db1.t1
Master           Slave
+-----+-----+    +-----+-----+
|  a  |  b  |    |  a  |  b  |
+-----+-----+    +-----+-----+
|  2  |  1  |    |  2  |  1  |
|  1  |  2  |    |  1  |  2  |
|  4  |  3  |    |  4  |  3  |
|  3  |  4  |    |  3  |  4  |
+-----+-----+    +-----+-----+

The MD5 did the trick and solved the problem. See the BIT_XOR result for the MD5 given above and the function is able to identify the distinct values in the tables and resulted with the different crc values. The MD5 (Message-Digest algorithm 5) is a well-known cryptographic hash function with a 128-bit resulting hash value. MD5 is widely used in security-related applications, and is also frequently used to check the integrity but MD5() and SHA1() are very CPU-intensive with slower checksumming if chunk-time is included.


Database Collision (computer science) Relational database MySQL

Published at DZone with permission of Peter Zaitsev, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Useful System Table Queries in Relational Databases
  • Fine-Tuning Performance, Resolving Common Issues in FinTech Application With MySQL
  • Optimizing Data Management: Migrating From AWS RDS MySQL to Snowflake
  • Common Mistakes to Avoid When Writing SQL Code

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!