Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

5 Best ETL Practices

DZone's Guide to

5 Best ETL Practices

Ready to make yourself more efficient? Pick up the 5 best practices for working with ETL (Extract, Transform, Load) for data.

Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

It is not possible to provide a list of instructions to build ideal and perfect ETL. Not only because of the complexity of a problem, but also because of diversity. The scope of ETL can vary depending on project requirements. Every part of building the solution: creating architecture, defining sources, mappings, building validation, applying business rules is important and have it’s own specificity.

However, there are some practices, which might help you and I would like to share them with you.

1. Start With Visualization

When you start new ETL project, typically you meet with a client to determine a list of sources, type of sources, and then build mappings. It can be the role of analyst or ETL consultant, but probably both are involved.

I would suggest you create a visualization of the ETL process before moving to the actual building. Even in very early stages of your project, you can take a piece of paper and draw the process (or just use professional architectural tools).

Ask yourself these questions:

  • Do I know everything (from analysis perspective) I should know? If not, ascertain that business knows about this shortage.
  • Do the sources give all required information and data for target? Do you have a list of sources, dictionaries, and mappings?
  • Do you know dependencies in systems? Do you know all relations?

2. Analyze

Even, if you have an analyst on your team, you should always analyze the data. Trying to find bugs in mappings, relations, or data in the beginning might let you save tons of hours of work. For example, when you get the first extract of source data, check it for NULLs, primary key, and foreign key occurrence. It might not be compatible with what business thinks. It happens very often in medium size data systems like SAP, where you can find a lot of financial exceptions.

3. Choose the Proper ETL Tool

This is a crucial step from a technical point of view. I suggest you analyze your case and if you have a preferred ETL tool, make use of it!

For example, if the client uses SAP as source and wants a free ETL tool then Talend Open Studio might be a very good choice. On the other hand, if the source and destination is Oracle Database, on which there is a build OBIEE reporting system, it might be a good idea to propose ODI as a tool. And lastly, if you have a lot of demands that will require a lot of customization – you might think about combining Python and SQL and build your own tool. These are just example cases you might see, but every case if different and should be diagnosed carefully.

Nevertheless, I encourage you to analyze your case and possible options. It might save you a lot of time later. 

4. Scalability

Scalability is not only a very trendy word, but also one of the key features every good ETL should possess. Always think about it and test your solution for it. Even if your ETL is now operating on 10,000 rows daily, it might grow later to 1,000,000 rows or more. Asking businesses about possible growth of data is also a very good idea.

For example: when using webservices as a source system, always check to see how they behave with bigger volumes. It is a common problem with lack of performance.

5. ETL Should Serve People, Not the Other Way Around

Build a proper logging, but think about it as helpful and useful information for maintenance, not a painful must-do. Building too big and too complex logging can make the future harsh.

Secondly, when building alerts always check who should get them and how they will be delivered. It might be not the best option to create automatic e-mail send to 1,000 people every day.

And lastly, write proper documentation, and start this task during ETL building (even just abstract), not two weeks after you finish. Try to write it as simple and as easy-to-understand as possible. 

Of course, this list is not complete — these are just some ideas. I would be very happy to discuss more with you in the comments.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,hadoop ,etl ,tools

Published at DZone with permission of Mateusz Komendołowicz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}