How I wrote a complex SQL query in 12 hours

Theo Okafor
5 min readMar 19, 2020

--

complex SQL (credit: complex.com and i0.wp.com)

“Given the plethora of ORMs available, I might never have to write SQL at all”. This was me when I started coding professionally 😅.

Recently, I got to write the most complex SQL query I have ever seen in my life ( It’s in the GitHub gist towards the bottom of the post). The experience was very challenging. I have to admit, I thought about giving up many times while working on this. It was also fun (well… towards the end).

In this post, I am going to explain how I crafted this beast. And should you ever need to (I’m betting that you would), I will attempt to show you how you can too.

For context, Here is the task that required the SQL query.

The Task (Credit: Codescreen and One Acre Fund)

In working on this challenge, I found that there are at least 3 major steps in writing complex SQL queries (performant or not). Firstly, I needed to understand the data, I needed to connect the data and finally, write the queries.

Understanding the data

This is the first step and a very important one. In my challenge, the data came in an SQL file and I had to run “migration” using NodeJS (I revealed how I did this in another post).

To understand the data, I took advantage of a database GUI tool — TablePlus. (TablePlus is an amazing software for SQL and NoSQL databases; From PostgreSQL, MySQL, SQLite, etc.).

My database on TablePlus with Long Rains Season 2011 highlighted

Since I did not design the database myself, this step was crucial.

As instructed in the task, the result of my query should have:

  1. the District
  2. the Season
  3. the Sector
  4. the Site
  5. GroupCount (each site has groups of farmers, I would be getting the number per site)
  6. ClientCount (each site has clients (farmers), they need to be active ones)
  7. TotalLandSize ( the total size of land per site which should not include the add-ons like street-lights which have their value as -1)
  8. The result should be ordered alphabetically by sector and then by site

NB: a Client in a Group; a Group is in Site; a Site is in a Sector; and a Sector is in a district.

Connecting the data

It took me about 3 hours or more and 2 iterations to figure out how the data is connected. Knowing how the data is connected informed me on the sequence of queries that I needed to run to be able to get to the final solution.

While going through the data, I used my notebook to take some notes about this sequence of the query.

Here are what I was able to find from this step:

  1. The id of the season and the real name of Long rains season 2011 is in the Seasons table
  2. the SeasonsClients contains the SeasonID, clientID and groupID, so it can be connected to Seasons, Clients SeasonClientsInputChoices and Groups tables.
  3. The Clients table is “useless”
  4. The SeasonClientsInputChoices table contains ClientID, Acres and SeasonID, so I can get the land sizes per client for the season in view.
  5. The Groups table contains SiteID, Active. Since clients belong to groups it can be determined whether a client is active. Also, from the SiteID it can be connected to the Sites table.
  6. The Sites table contains SiteName and SectorID, so it can be connected to Sectors table for me to get the sector name
  7. The Sectors table contains SectorName and DistrictID. (There is only one disrict, it’s districtID is 5).
  8. The Districts table contains the DistrictName and DistrictID.

From these connections, I could get the District, Season, Sector, Sites, GroupCount, ClientsCount and TotalLandSize. That is everything that the final result is required to have.

Writing the query

After determining the connection between the various tables and data, the next step was to attempt the query.

The “pseudo-query”

I started with a “pseudo-query”. It looks something like what we have below: (don’t take it too seriously, it was meant to guide my thought process 😅)

- Select all from SeasonClients where seasonid = 80 (for long rains 2011)
- Select count as groupCount from group by groupid and active = 1
- Select site by siteID (order alphabetically)
- Select sectorname, districtId by sectorId (order alphabetically)
- Select districtName from district by districtID
- Select Clients and select count clients in active groups in the seasonClient table where season = 80 (long rains season 2011)
- Select and sum land sizes in seasonClientInputChoices as totalLandSizes

I used the indentation to show the preceding query that another query would depend on.

Looking at my “pseudo-query” everything looks simplified and easier, right? Don’t get carried away though, the actual query turned out 18 times longer than this!

The actual query

The entire query was made possible by the SQLite with clause. With the clause, I was able to create multiple Common Table Expressions or CTEs.

CTEs act like temporary views that exist only for the duration of a single SQL statement

This is what my query looked like at the end (Note that this is a single select statement).

  1. getSeason CTE

I started with a CTE called getSeason. getSeason contains SeasonName, GroupID, SeasonId and ClientID selected from both Seasons and SeasonClients tables.

2. getClients CTE

The getClients CTE contains the SiteID and all other contents of the getSeason CTE after running a select statement to determine active groups in getSeason.

3. getClientCountPerSite CTE

As the name implies, this one gets the number of client per site, using the result from the getClients CTE.

4. getGroupCountPerSite CTE

This one gets the count of groups per site, also using the result from the getClients CTE.

5. getTotalLandSizePerSite CTE

This one is obvious, right?

6. getSites CTE

This is where I put all of the sites-related data together. The sites-related data included the GroupCount, ClientCount, TotalLandSize, SiteName and SiteID. The CTE also contains the SectorID which I used in the next CTE (getSector).

7. getSector CTE

To get the SectorName and DistrictID

With the DistrictID, I made the actual select statement against the districts table to get the DistrictName.

That is it. That’s how I did it!

This is a screen record of the result below.

credit: me!

If you look at the result closely, you’d notice that it took 681 milliseconds to run ( the average is 550ms). This is slow.

I am open to suggestions and contributions on how I could have done this challenge better. Please leave me a comment in the comment section. I’m here to learn too 😀.

Thanks for reading!

Acknowledgement

I got a lot of help from https://www.sqlitetutorial.net/ and https://www.sqlite.org/. Like I mentioned before, this was a challenge. It was part of One Acre Fund’s recruitment process. I am grateful to have taken part in it.

--

--

Theo Okafor
Theo Okafor

Written by Theo Okafor

Software Engineer. Loves to play Coldplay on repeat while coding JavaScript and Python apps.

No responses yet