Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Column Compression (With an Optional Predefined Dictionary)

DZone's Guide to

Column Compression (With an Optional Predefined Dictionary)

This blog will demonstrate a test scenario for column compression with the helping hand of a predefined dictionary — something for the QA and bug hunters out there.

· Database Zone
Free Resource

Finding a database that fits with a container-based deployment model can be frustrating. Learn what to look for in a Docker database

column compression

This blog discusses column compression with an optional predefined dictionary.

Compression, more compression with different algorithms, compress again, compress multiple times! Compression is a hot topic in our lives.

In general, testing new things is great if the processes are well-described and easy to follow. Let’s try to think like a QA engineer: the first golden rule of QA is “everything is buggy, life is full of bugs: good night bugs, good morning bugs, hello my old bug friends.”

The second golden rule of QA is “OK, now let’s find a way to catch a bug — but remember that your methods can be buggy, too.”

Remember: Always test! No bugs, no happiness!

When you start to test, the first goal is getting an idea of what is going on. This blog will demonstrate a test scenario for column compression with an optional predefined dictionary. For reference on column compression, read Compressed columns with dictionaries.”

To begin, let’s set up a basic environment:
The installation process requires installing Percona Server which is already documented here -> PS 5.6 installation

Secondly, find an already existing test: xtradb_compressed_columns_ibd_sizes.test.

Third, write a simple script to get started:

import mysql.connector
cnx = mysql.connector.connect(user='msandbox', password='msandbox',
                              host='127.0.0.1',
                              database='dbtest',
                              port=22896,
                              autocommit=True)
cursor = cnx.cursor()
crt_comp_dic = "CREATE COMPRESSION_DICTIONARY names2 ('Bartholomew')"
cursor.execute(crt_comp_dic)
table_t1 = "CREATE TABLE t1(id INT,a BLOB) ENGINE=InnoDB"
table_t2 = "CREATE TABLE t2(id INT,a BLOB COLUMN_FORMAT COMPRESSED) ENGINE=InnoDB"
table_t3 = "CREATE TABLE t3(id INT,a BLOB COLUMN_FORMAT COMPRESSED WITH COMPRESSION_DICTIONARY names) ENGINE=InnoDB"
cursor.execute(table_t1);
cursor.execute(table_t2);
cursor.execute(table_t3);
insert_stmt = "insert into {} values({},repeat('Bartholomew', 128))"
for i in range(0, 100000):
	cursor.execute(insert_stmt.format('t1', int(i)))
	print insert_stmt.format('t1', int(i))
	cursor.execute(insert_stmt.format('t2', int(i)))
	print insert_stmt.format('t2', int(i))
	cursor.execute(insert_stmt.format('t3', int(i)))
	print insert_stmt.format('t3', int(i))
cursor.close()
cnx.close()


As you might notice, column compression might be with or without a compression dictionary. The visible difference, of course, is in the size of the tables. If you want to compress columns based on a predefined dictionary, you should create it with frequently used data. It is possible to create an empty dictionary, but it will have no effect. (See here: #1628231.)

The result of running this script is:

100.000 rows tables

t1 -> uncompressedt2 -> compressed column, t3 -> compressed column with compression dictionary, ‘names2’ dictionary

t2 -> compressed column,

t3 -> compressed column with compression dictionary, ‘names2’ dictionary.

Table size difference:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lth | grep .ibd
-rw-rw---- 1 sh sh 168M Sep 29 23:43 t1.ibd
-rw-rw---- 1 sh sh  15M Sep 29 23:43 t2.ibd
-rw-rw---- 1 sh sh  14M Sep 29 23:43 t3.ibd


The resulted size:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd
-rw-rw---- 1 sh sh 160M Sep 29 23:52 t1.ibd
-rw-rw---- 1 sh sh 8.0M Sep 29 23:52 t2.ibd
-rw-rw---- 1 sh sh 7.0M Sep 29 23:52 t3.ibd


Using  ROW_FORMAT=COMPRESSED requires innodb_file_format to be >  Antelope. But this is not true for COLUMN_FORMAT.

Again, check the size:

sh@sh-ubuntu:~/sandboxes/rsandbox_percona-server-5_6_31/master/data/dbtest$ ls -lh | grep .ibd
-rw-rw---- 1 sh sh  76M Sep 29 23:57 t1.ibd
-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t2.ibd
-rw-rw---- 1 sh sh 4.0M Sep 29 23:58 t3.ibd


Question: How do I get information about column compression dictionary and tables? Answer: tables from information_schema:

master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict;
+----+--------+-------------+
| id | name   | zip_dict    |
+----+--------+-------------+
|  1 | names  | Bartholomew |
|  2 | names2 | Bartholomew |
+----+--------+-------------+
2 rows in set (0.00 sec)
master [localhost] {msandbox} ((none)) > SELECT * FROM information_schema.xtradb_zip_dict_cols;
+----------+------------+---------+
| table_id | column_pos | dict_id |
+----------+------------+---------+
|       67 |          1 |       1 |
+----------+------------+---------+
1 row in set (0.00 sec)


When you're looking for a SQL database that can scale elastically, while still preserving ACID guarantees, you only have a few choices. Find out how these elastic SQL databases perform in thishead-to-head YCSB benchmark.

Topics:
qa ,columns ,database ,compression

Published at DZone with permission of Shahriyar Rzayev, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}