DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • How to Restore a Transaction Log Backup in SQL Server
  • How to Attach SQL Database Without a Transaction Log File
  • A Deep Dive into Apache Doris Indexes
  • Spring Boot Sample Application Part 1: Introduction and Configuration

Trending

  • Unlocking AI Coding Assistants Part 4: Generate Spring Boot Application
  • A Developer's Guide to Mastering Agentic AI: From Theory to Practice
  • Top Book Picks for Site Reliability Engineers
  • Breaking Bottlenecks: Applying the Theory of Constraints to Software Development
  1. DZone
  2. Data Engineering
  3. Databases
  4. How To Use SQL to Directly Query Files

How To Use SQL to Directly Query Files

In this article, we'll take a look at a few tools you can use to quickly and conveniently extract insights from your .csv, .tsv, and .json files, and much more!

By 
Harshil Patel user avatar
Harshil Patel
·
Updated Feb. 11, 22 · Tutorial
Likes (6)
Comment
Save
Tweet
Share
9.0K Views

Join the DZone community and get the full member experience.

Join For Free

Article Image

Have you ever been sent a file and asked to find important information buried within it? Your coworkers would be very impressed if you could query the files in a quick and efficient manner. But... how exactly are you going to achieve such a feat?

As you probably know, SQL allows you to modify database data quickly and easily. When trying to work with data files, developers usually load data into a database and manage it via an SQL client like Arctype. In a perfect world, you could just query the database to get the information your company needs. 

But in real life, there’s a catch: data loading is often not straightforward.AaIt would be extremely convenient if SQL queries could be run directly on files, skipping the database setup step. It turns out, other people have noticed this fact and have built tools to query your files directly with SQL. In this article, we'll look at a few tools and list some different SQL file query scenarios that they can handle. We'll also compare the tools and share some general considerations. Let's begin.

TextQL

TextQL might be a suitable fit for you if you're looking for a simple tool to query .csv or .tsv files. TextqQL allows you to execute SQL on structured text with ease. It also allows you to use quote-escaped delimiters. When running a query, TextQL can automatically detect numeric and datetime data in appropriate formats, which helps you make your work easier. TextQL lets you list as many files and folders as you want. You can load the files or directories you're working with by listing them at the end of the command.

Installation

Let's look at a few ways to install TextQL. One of the simplest methods you can use to get started is to use Homebrew. Just run the command below:

 
brew install textql


If you're running Docker, you can also install TextQL with a single short command:

 
docker build -t textql .


Lastly, if you're using Linux, you may be familiar with the AUR tool. AUR installation is also straightforward:

 
yaourt textql-git


Now that we've gotten TextQL set up and working, let's see what it can do.

Example

We're going to examine an example to understand how TextQL works. Assume we have the following data:

 
cat sample_data.csv
Id,name,value,timestamp
1, Jack,5,1643004723 
1, John,11,1643114723 
1, James,-3,1645596723


Now, excluding the header, let's count the number of data items we have.

 
textql -header -sql "select count() from sample_data" sample_data.csv

#output
3


You should see an output of "3," which is exactly what we'd expect. We can also use TextQL to find the maximum value of this data. We can do that by running the command below:

 
textql -header -sql "select max(value) from sample_data" sample_data

#output
11


Again, we see the expected value of 11. Very nice!

q

You might consider using q to query your files if you are working with .csv or .tsv files and need something that works faster than TextQL. q aims to bring SQL's expressive capability to the Linux command line by offering direct access to multi-file SQLite3 databases and simple access to the text as actual data.

q uses the SQLite engine. It allows you to run SQL-like statements directly on tabular text data, with the data being auto-cached to speed up subsequent queries on the same file. Using q, SQL statements can be run directly on multi-file SQLite3 databases without requiring them to be merged or loaded into memory.

Installation

Like TextQL, it's easy to install q via homebrew. Simply run the command below:

 
brew install harelba/q/q


q also has a standalone executable, which you can download from this link. If you want to use the Windows installer, you can run the executable and follow the prompts displayed on the screen. Full installation instructions, as well as releases for each platform, can be found here.

You can also install q as an .rpm package. You can download the package from this link, then install it via the following commands:

 
rpm -ivh <package-filename>
Or
rpm -U <package-filename>


Example

To get an idea of how q works, let's run a query on a file where columns are named a1, a2, ... aN.

 
q -H "select a1,a5 from file.csv"


Now, we'll COUNT DISTINCT values in a specific field. In this case, let's try to see how many UUIDs we have with the following command:

 
q -H -t "SELECT COUNT(DISTINCT(uuid)) FROM ./file.csv"


The output of this command will be a numeric value, equal to the number of UUIDs in the file.

OctoSQL

OctoSQL is primarily a command-line application that allows you to query a variety of databases and filetypes using SQL in a single interface, as well as perform JOINS between them. OctoSQL is a fully expandable, fully-featured dataflow engine that can be used to provide a SQL interface for your applications. It validates and optimizes queries based on database types. It may process massive volumes of data and return partial results before completing the complete query. Out of the box, OctoSQL only supports .csv and JSON files. You'll need to install a plugin to query other file formats, like Excel or Parquet files.

Installation

You can run the following command to install OctoSQL with Homebrew.

 
brew install cube2222/octosql/octosql


Installation with Go is also supported, as shown below:

 
go install -u github.com/cube2222/octosql


As mentioned, OctoSQL has a robust plugin capability. You can install plugins using commands like the one shown below:

 
octosql plugin install postgres


Example

Let's have a look at a simple example of invoice data. Suppose we have .csv of invoice data in the format shown below:

 
octosql "SELECT * FROM ./invoice2.csv"

cust_id,cust_name,value
121, Jack,599.00
122, John,1100.00
123, James,400.50


Suppose we want to calculate a sum of the invoice values. In that case, we could use the following command to find the answer:

 
#Sum
octosql "SELECT id, SUM(value) as value_sum FROM ./invoices2.csv GROUP BY id ORDER BY value sum DESC”


Similarly, we can find a count of all the entries using this command:

 
#Count
octosql "SELECT COUNT(*) FROM mydb.customer”


We can also perform more advanced operations like joins. The command below will get us what we want:

 
octosql "SELECT inv_id, value, email FROM ./invoices.csv JOIN mydb.customers ON customer_id = customer_id”


OctoSQL may be the ideal fit for you if you require a tool that can work with a variety of file formats. One drawback that few people consider is that OctoSQL uses a custom engine instead of SQLite, which indicates that it may be missing some features.

DSQ

DSQ is a command-line tool that lets you execute SQL queries. It supports a wide range of file formats, including .csv, JSON, .tsv, Excel, Parquet, and .ods.

Installation

To install DSQ on a Mac or Linux operating system, use the following command, visit the releases page and download the release you want. Then simply unzip the download and add DSQ to your $PATH. You can follow the same steps for Windows.

You can also install DSQ in Go using the following command:

 
$ go install github.com/multiprocessio/dsq@latest


Example:

DSQ allows you to either stream data or provides a file name to work with. The example below shows how you can use either a .json or .ndjson format, depending on your requirements.

 
$ dsq testdata.json "SELECT * FROM {} WHERE x > 10"
#or
$ dsq testdata.ndjson "SELECT name, AVG(time) FROM {} GROUP BY name ORDER BY AVG(time) DESC"


You can also use DSQ to work with data from different origin types. The example below connects a .csv dataset and a .json dataset.

 
$ dsq testdata/join/users.csv testdata/join/ages.json \
      "select {0}.name, {1}.age from {0} join {1} on {0}.id = {1}.id"


If you need a tool that can handle a wide range of file formats and uses SQLite, this is the tool for you.

Cases for SQL File Queries

To see these tools in action, let's look at a few SQL file query examples. We'll be using .txt or .csv files for these examples, since almost all of the tools we discussed above support this file format.

Filtering via SQL

Filtering allows you to view only the information you wish to see. Filters are useful for displaying only the desired records from a form or report. Using a filter, you may limit the data in a view without affecting the architecture of the underlying object.

For example, let's start by selecting records from a text file that contains the information below:

CLASS NAME ENGLISH HISTORY MATH
1 James Kim 78 65 67
1 John White 87 61 79
2 Paige Davis 77 82 94
3 Edwin Henderson 65 78 94

Suppose we want to choose the students in class 1 in the above file. The file's first row provides column names, whereas the rest of the rows have comprehensive information that we can use to query the records we want. So, we could run the following command to get the information we want:

 
$select * from E:/txt/Students_file.txt where CLASS = 1


In SQL, you may sort data in ascending or descending order by one or more columns. To try this out, let's sort the table of student scores by class in ascending order and total score in descending order. We can do that with the following command:

 
$select * from E:/txt/Students_file.txt order by CLASS,ENGLISH+HISTORY+MATH desc


You can combine results using aggregation by grouping records depending on their value. Grouping can also be used to calculate the sum of many values in a group. The grouped aggregates give a summary of a set of rows. For example, you can use the command below to find the lowest English score, the highest History score, and the overall Math score.

 
$select CLASS,min(English),max(History),sum(Math) from E:/txt/students_file.txt group by CLASS


As you can see, the ability to perform SQL queries directly on files can help us answer important questions from our data.

Conclusion

In conclusion, let's use the chart below to take a quick look at all of the tools we mentioned before.

NAME SUPPORTED FILES ENGINE
q .csv, .tsv SQLite
TextQL .csv, .tsv SQLite
OctoQL .csv, .json, Excel, Parquet Custom
DSQ .csv, .tsv, .json, Parquet, Excel, logs SQLite

Without having to go through a database, you can use the tools above to execute SQL on your files for instant insights. Each of these tools has different pros and cons, so be sure to choose the option that best suits your needs. Happy querying.

Database sql

Opinions expressed by DZone contributors are their own.

Related

  • How to Restore a Transaction Log Backup in SQL Server
  • How to Attach SQL Database Without a Transaction Log File
  • A Deep Dive into Apache Doris Indexes
  • Spring Boot Sample Application Part 1: Introduction and Configuration

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!