DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Why Data Competency Is Critical for Cyber Intelligence
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Dark Data: Recovering the Lost Opportunities
  • Data Processing With Python: Choosing Between MPI and Spark

Trending

  • Doris: Unifying SQL Dialects for a Seamless Data Query Ecosystem
  • Chaos Engineering for Microservices
  • AI's Dilemma: When to Retrain and When to Unlearn?
  • Optimize Deployment Pipelines for Speed, Security and Seamless Automation
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Useful Tips and Tricks for Data Scientists

Useful Tips and Tricks for Data Scientists

In this article, I will share some of the tricks and tools that I am using to interpret the data in a fast and precise way and get useful insights from it.

By 
Pavel Perfilov user avatar
Pavel Perfilov
·
May. 22, 24 · Tutorial
Likes (35)
Comment
Save
Tweet
Share
5.4K Views

Join the DZone community and get the full member experience.

Join For Free

In the world of data science, there are countless libraries and tools that can help speed up your work and make your analyses more efficient. 

As a data analyst and researcher, I have developed tools for processing terabytes of data and performed anomaly research and analysis in the field of brokerage and trading. During my career, I have accumulated some useful knowledge on how to efficiently analyze large amounts of data as well as solve uncommon tasks that arise in the field. 

In this article, I will share some of the tricks and tools that I am using to interpret the data in a fast and precise way and get useful insights from it. I hope you’ll find something useful for you to implement in your data research. 

Useful SQL Functions for Data Analysis

I'm doing a lot of data engineering and dataset preparations in my daily research work, and I’ve gathered several useful tips that are not very popular — but can be very helpful. 

Nowadays data analysts are mostly using Python frameworks for data manipulations. However, there may be better and more efficient options. 

Sometimes, it is good to go “back to school” and use some SQL functions instead. Unfortunately, SQL is no longer studied that widely at the universities — now Pandas has become a default option, however, SQL has several advantages that can facilitate your work. Young data scientists should still get to know and use SQL because databases usually have more varied resources than Python notebooks.

Here are some less popular SQL functions that can be especially with data preparations: 

  1. CROSS JOIN— Used to do operations between rows that might prove cumbersome in Python. 
  2. SUM() OVER (PARTITION BY .. ORDER BY) grouping expression which could be applied without GROUP BY, can be used to divide a result set into partitions based on the values of one or more columns. These partitions can then be used to perform calculations and aggregate functions
  3. ROW_NUMBER— Assigns a unique numerical identifier to each row, facilitating sorting, filtering, and other operations, useful for sequential analysis
  4. COALESCE — Used to handle the Null values. This function evaluates arguments in a particular order from the provided arguments list and always returns the first non-null value. If you’re doing multiple unions and want to have a common filled column, coalesce would help with that. 
  5. GROUP_CONCAT, STRING_CONCAT — Merges the string representations of all arguments that are passed to it, and returns the result as a new string. 

Logs as a Data Source

In my career, I’ve spent quite a lot of time studying log files in order to spot anomalies and got used to looking at these pieces of information “under a microscope”. Logs help understand in detail what happened at a certain moment of time when the issue occurred — but how to extract that information in the most efficient way?

Most of the time, logs are viewed as a useless amount of information in which you have to search for a particular moment in time or find a certain event (or error). Then you have to figure out what exactly went wrong and, based on your findings, fix the error. 

But what if you considered a log file as a data scientist dataset? Looking at it from a different angle, you can realize that logs can actually be very helpful for statistical analysis. I’ve listed some advantages of the data handling functionalities that you can use and how they can help you with interpreting your data. 

  1. Through statistical analysis of the logs, you can get a better understanding of what the app is doing exactly and how many times. 
  2. You can implement anomaly detection analysis to find out the delay between the events.
  3. Through supervised learning, you may be able to identify whether the app is already experiencing problems. For that, the Classification type of supervised learning can come in handy, as it uses an algorithm to accurately assign test data into specific categories.
  4. Another application of supervised learning is predicting what the future performance of the app would be like. 
  5. You might be able to perform modeling of the “what if” scenarios — however, this may be quite complicated. 
  6. ACF functions can be helpful for finding delays and queues. 
  7. If you are looking at unknown data, NLP functionality can help interpret it in a more user-friendly form. 

Data Aggregation Without Losing Details

Big data assumes various aggregation techniques. What’s the downside of this approach? 

Most of the time data is normalized in some form, and it has a flat or, in other words, relational representation of the data. When you perform aggregation, your data is transformed in the way that some of the fields and attributes are being skipped, simply because they are not relevant at the time. 

However, when you try to build a hypothesis based on the aggregated data through ML or any other statistical analysis tool, you might need to bring the skipped details back. The problem is, that you’ve already lost them from your dataset and you need to start the cycle all over from (a) and include the field that you think might be important now.

This is how you can solve the problem and save a lot of time:

  1. Keep the original dataset available for enrichment.
  2. Perform the “Top-down” analysis, which means joining aggregated data to the original dataset in order to spot anomalies. 

Here is the SQL query written for ClickHouse, PostgreSQL, and MySQL databases that you can use:

SQL
 
-- create
CREATE TABLE EMPLOYEE (
  empId INTEGER PRIMARY KEY,
  name TEXT NOT NULL,
  dept TEXT NOT NULL,
  city TEXT  NULL,
  new_city text  NULL
);

-- insert
INSERT INTO EMPLOYEE VALUES (0001, 'Clark', 'Sales', 'Hamburg', NULL);
INSERT INTO EMPLOYEE VALUES (0002, 'Dave', 'Accounting','Paris', NULL);
INSERT INTO EMPLOYEE VALUES (0003, 'Ava', 'Sales','Tallinn', NULL);
INSERT INTO EMPLOYEE VALUES (0004, 'Ava', 'Sales','Paris', NULL);
INSERT INTO EMPLOYEE VALUES (0005, 'Eva', 'Sales',NULL, 'Madrid');

-- QUERY  MYSQL 
SELECT dept
  , ROW_NUMBER() OVER () as n
, COUNT(DISTINCT city) uniq_cities
, COUNT(DISTINCT coalesce(city, new_city) ) uniq_cities_
, GROUP_CONCAT(CONCAT_WS(';',name, dept,city)  SEPARATOR '\n')
FROM EMPLOYEE
GROUP by dept


--QUERY  Clickhouse 
select dept
      , ROW_NUMBER() OVER () as n
    , uniq(city) uniq_cities
    , uniq(coalesce(city, new_city)) uniq_cities_
    , arrayStringConcat(groupArray(concatWithSeparator(',', toString(tuple(*)))),';') as all_fields
FROM EMPLOYEE
GROUP by dept


--QUERY  PostgreSQL
SELECT dept
        , ROW_NUMBER() OVER () as n
, COUNT(DISTINCT city) uniq_cities
      , COUNT(DISTINCT coalesce(city, new_city) ) uniq_cities_
 , json_agg(ROW_TO_JSON(EMPLOYEE))
FROM EMPLOYEE
GROUP BY dept


Alternatively, if you are using Pandas, run this code:

Python
 
import pandas as pd

data = [
	(1, "Clark", "Sales", "Hamburg"),
	(2, "Dave", "Accounting", "Riga"),
	(3, "Ava", "Sales", "Tallinn"),
	(4, "Ava", "Sales", "Paris"),
	(5, "Eva", "Sales", "Paris"),
]

df = pd.DataFrame(data)

df.columns = ["empId", "name", "dept", "city"]

df.groupby(["dept"]).agg(uniq_cities=("city", "count")).join(
	df.groupby(["dept"]).apply(lambda x: x.to_dict()).rename("all_fields")
)


This would allow you to unfold data back and check the details of your aggregates.

Effective Visualization for Data Discovery

When I’m thinking about how to improve the visibility of a certain process and describe it from a statistical analysis perspective, I always come to the conclusion that visualization is a form of representing some vectors in compressed form, which could reflect multiple dimensions.

All statistical books and articles say that the very first step you should do with the data — is to make a sample and try to make a scatter plot or some chart, but what if your dataset is very large and contains 100 or more columns? In this case, it is quite hard to show as many dimensions as possible while keeping them informative and relevant. 

Use clear and easy for understanding representation, to avoid discussions of what means what, and rather have a discussion about actual data problems which are represented. Remember that if you’re trying to tackle a complicated problem, there will always be a combination of plots. Follow these important steps to make your data easier to understand in the visual form:

  1. Show what the dataset looks like so that there is an understanding of “what we are looking at”
  2. Demonstrate your data
  3. Write a comment

Over the years I have singled out several visualization options that I find very helpful for data analysis. These are the Python charts that helped me to save time and come to some preliminary conclusions.

  • Scatter plot with coloring and sizing of different categories
  • A heatmap or a facet grid of heatmaps 
  • ecdf and q2q graphs that can be used for the data distribution discovery and checking the ‘normality’ of pairs of factors
Big data Data analysis Data science

Opinions expressed by DZone contributors are their own.

Related

  • Why Data Competency Is Critical for Cyber Intelligence
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Dark Data: Recovering the Lost Opportunities
  • Data Processing With Python: Choosing Between MPI and Spark

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!