Oracle vs Teradata vs Hadoop for Big Data
Join the DZone community and get the full member experience.Join For Free
This focuses in big and very big data-warehouses, but for a better picture small ones will be briefly mentioned.
The article wrote for people who need speed more than other factors, i.e. systems that aim for a rough full scan (Oracle's people are strained, Teradata's people are happy :))
But facilities will be touched upon.
Let's examine for what volume and needs fits Oracle, Hadoop, NoSQL, Teradata.
1) Having small volume it's beneficial to use NoSQL if you are pity to spend 900$ for Oracle SE One.
The main advantage is price, NoSQL usually are free.
Small volume of data means no big difficulty of model and development.
2) For medium and big volume Oracle has a lot of advantages against Teradata and Hadoop
The main advantages are:
1.Very big mature of technologies and product, numbers of implementations against Teradata and Hadoop. Hadoop is young, Teradata is 30 years on the market, but has small number of implementations.
2.Very big batch of tools and facilities makes easier and faster development
3.Price in compare to Teradata.
4.Good scalability and speed, but has one bottle neck, it is a storage subsystem, it is one for all calculations servers.
So up to a certain limit Oracle has one of the best speed of data processing.
Shortage of Oracle:
Most fastest self-made massive of storing of data I saw is 18 Gb/s.
Exadata Full Rack provides 25 Gb/s (without compression, storage sell and cache optimization), costs around 5m.$
Teradata provides 34 Gb/s (without compression optimization), costs around 5m.$
Classer of 100 commodity servers provides 40 Gb/s, costs around 250 000$
But it is usual that full scan in Oracle is not enough.
I tell you about Beeline that had in 2007 170 millions rows in one day in one table, it was all calls in Russia.
To analyze to read this big table it's impossible, you will never have enough speed of hard disks.
In this case use technology optimization by creating on this big table few aggregates in 4 millions rows per day.
And using this aggregates build reports.
Such optimization can be done in Oracle, Teradata, Hadoop
This technology optimization has 3 shortages:
1. If business users needs new field in report, you should stretch out this field from fact thought aggregate to report. It's very long time to develop.
2. If business users needs answer just now it's not possible. While you do development the question can be aged out.
3. Complicated ETL
Here to solve these 3 shortages Beeline decided to mode some tasks (especially ad-hoc) on Hadoop.
3) For extremely large data you can use Hadoop
It has advantages:
1. Almost endless scalability. You can get 25 or 125 or 1000 Gb/s
2. Price, everything is free :) besides of hardware of course.
1. Creating Map Reduce more difficult especially you already have trained SQL men. Creating Map Reduce as more difficult as more complex logic are.
So, to create Ad-hoc queries will not be so simple as in SQL.
2. Commodity server consume more energy supply and hold more space.
3. Hadoop on commodity servers needs double redundancy of data, more than for other servers.
4) For extremely large data on Teradata
Teradata much more deals with rough method work with data, such as full scan, than Oracle.
Teradata's ideology is shared-nothing and it looks like Hadoop. Data stored on spread-shared servers, each servers compute its own part.
Teradata's manual sharding is possible and even needs to be.
But Teradata has one significant drawbacks:
1. Quite poor tools. It's not comfort to work with in compare to Oracle. Oracle is a mature product, Teradata has some children's disease.
Price, Full Rack Teradata and Exadata prices are comparable, around 5 m.$
Also mention common point on Teradata and Hadoop(HBase). It needs to shard data around nodes. Data needs to fit in sharding evenly around all the nodes.
For example region is not a good candidate for Beeline. Moscow region holds 20% of data.
The benefit of Teradata is has in fact quad partitioning, Oracle has double. Having 170 millions quad or tripple partition is very good.
Limit of Teradata:
Teradata having such technologies as shared-nothing and network BYNET V5 has scalability up to 2048 nodes, 76Tb(10K) per node, in total 324Pb.
One Full Rack Exadata has more humble limit, maximum 672Tb(7.2K) or 200Tb(15K). To parallel Exadata is not so profitable, disk area is one for all!
If join two Exadata machines (Does it afford? theoretically yes) all will hang back to bandwidth of 40Gbit Network. Rather 1st Rack will have fast access to its own HDDs,
but slower to 2nd's Rack's HDDs, and vise verse.
We should also consider, that Teradata and Exadata have column and hybrid compression. Up to 4-6 times in a middle. But NoSQL databases also have it,
maybe not so advanced like in these monsters who spent a lot of money of it.
For complete picture it needs to mention:
1) Oracle has 3 caches - memory on Storage Cell, SDD PCI Flash cards and memory of main server.
Teradata has 1 cache - memory of nodes. But Teradata has know-how temperature storing!
Having this and consider MVCC Oracle is much more appropriate for OLTP load.
2) Exadata has smart Storage Cell, it can filter data before sending it to main server. Filtering on OS and firmware level of HDD.
Oracle has most rich tools and speed of development will be faster.
Teradata also has powerful SQL language, but with few shortages, but has poor procedure language and tools.
Hadoop is a set of a very large number of tools, I will not describe them all. But one fast that Hadoop was developed for batch-processing with
long latency and nonexistence of joins tell
Roughly saying full-scan ratio looks like this:
So, consider what you need fast full scan or flexibility of product.
Teradata and Oracle can be comparable with train and icebreaker. Teradata is a train ride fast but only on rails, you cannot turn it to left or to right.
Oracle is slower on big data, but can ride everywhere, you can turn it to anyway, it will break any ice, it has thousands of features and optimizations.
Hadoop is like a rocket, it can be very fast, but only in a very specific tasks, on a very specific way.
If you don't have ad-hoc queries, all queries are known in advance and data less than 600TB, buy Oracle or Exadata.
If you have more data, it makes sense to think about Teradata or Hadoop depends of having joins and SQL.
If you have ad-hoc and business needs to answer on it ASAP, so use very big cluster on Hadoop or Teradata.
For Racks of Teradata costs 80 m.$
I also wanted to add connection of Oracle + Lustre, but understood it brings nothing new to Oracle, we have the same hang back bandwidth of 40Gbit of network.
Also there are new database like Teradata - Vertica, Aster Data, Greenplum, Netezza. I think my new article will be about comparison of all of these.
Opinions expressed by DZone contributors are their own.