Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Column Compression (With an Optional Predefined Dictionary)

DZone's Guide to

Column Compression (With an Optional Predefined Dictionary)

This blog will demonstrate a test scenario for column compression with the helping hand of a predefined dictionary — something for the QA and bug hunters out there.

· Database Zone
Free Resource

Learn how to move from MongoDB to Couchbase Server for consistent high performance in distributed environments at any scale.

column compression

This blog discusses column compression with an optional predefined dictionary.

Compression, more compression with different algorithms, compress again, compress multiple times! Compression is a hot topic in our lives.

In general, testing new things is great if the processes are well-described and easy to follow. Let’s try to think like a QA engineer: the first golden rule of QA is “everything is buggy, life is full of bugs: good night bugs, good morning bugs, hello my old bug friends.”

The second golden rule of QA is “OK, now let’s find a way to catch a bug — but remember that your methods can be buggy, too.”

Remember: Always test! No bugs, no happiness!

When you start to test, the first goal is getting an idea of what is going on. This blog will demonstrate a test scenario for column compression with an optional predefined dictionary. For reference on column compression, read Compressed columns with dictionaries.”

To begin, let’s set up a basic environment:
The installation process requires installing Percona Server which is already documented here -> PS 5.6 installation

Secondly, find an already existing test: xtradb_compressed_columns_ibd_sizes.test.

Third, write a simple script to get started:

import mysql.connector
cnx = mysql.connector.connect(user='msandbox', password='msandbox',
                              host='127.0.0.1',
                              database='dbtest',
                              port=22896,
                              autocommit=True)
cursor = cnx.cursor()
crt_comp_dic = "CREATE COMPRESSION_DICTIONARY names2 ('Bartholomew')"
cursor.execute(crt_comp_dic)
table_t1 = "CREATE TABLE t1(id INT,a BLOB) ENGINE=InnoDB"
table_t2 = "CREATE TABLE t2(id INT,a BLOB COLUMN_FORMAT COMPRESSED) ENGINE=InnoDB"
table_t3 = "CREATE TABLE t3(id INT,a BLOB COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY names) ENGINE=InnoDB"
cursor.execute(table_t1);
cursor.execute(table_t2);
cursor.execute(table_t3);
insert_stmt = "insert into {} values({},repeat('Bartholomew', 128))"
for i in range(0, 100000):
	cursor.execute(insert_stmt.format('t1', int(i)))
	print insert_stmt.format('t1', int(i))
	cursor.execute(insert_stmt.format('t2', int(i)))
	print insert_stmt.format('t2', int(i))
	cursor.execute(insert_stmt.format('t3', int(i)))
	print insert_stmt.format('t3', int(i))
cursor.close()
cnx.close()


As you might notice, column compression might be with or without a compression dictionary. The visible difference, of course, is in the size of the tables. If you want to compress columns based on a predefined dictionary, you should create it with frequently used data. It is possible to create an empty dictionary, but it will have no effect. (See here: #1628231.)

The result of running this script is:

100.000 rows tables

t1 -> uncompressedt2 -> compressed column, t3 -> compressed column with compression dictionary, ‘names2’ dictionary

t2 -> compressed column,

t3 -> compressed column with compression dictionary, ‘names2’ dictionary.

Table size difference:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lth | grep .ibd
-rw-rw---- 1 sh sh 168M Sep 29 23:43 t1.ibd
-rw-rw---- 1 sh sh  15M Sep 29 23:43 t2.ibd
-rw-rw---- 1 sh sh  14M Sep 29 23:43 t3.ibd


The resulted size:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd
-rw-rw---- 1 sh sh 160M Sep 29 23:52 t1.ibd
-rw-rw---- 1 sh sh 8.0M Sep 29 23:52 t2.ibd
-rw-rw---- 1 sh sh 7.0M Sep 29 23:52 t3.ibd


Using  ROW_FORMAT=COMPRESSED requires innodb_file_format to be >  Antelope. But this is not true for COLUMN_FORMAT.

Again, check the size:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd
-rw-rw---- 1 sh sh  76M Sep 29 23:57 t1.ibd
-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t2.ibd
-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t3.ibd


Question: How do I get information about column compression dictionary and tables? Answer: tables from information_schema:

master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict;
+----+--------+-------------+
| id | name   | zip_dict    |
+----+--------+-------------+
|  1 | names  | Bartholomew |
|  2 | names2 | Bartholomew |
+----+--------+-------------+
2 rows in set (0.00 sec)
master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict_cols;
+----------+------------+---------+
| table_id | column_pos | dict_id |
+----------+------------+---------+
|       67 |          1 |       1 |
+----------+------------+---------+
1 row in set (0.00 sec)


Want to deliver a whole new level of customer experience? Learn how to make your move from MongoDB to Couchbase Server.

Topics:
qa ,columns ,database ,compression

Published at DZone with permission of Shahriyar Rzayev, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}