How I wrote a complex SQL query in 12 hours
“Given the plethora of ORMs available, I might never have to write SQL at all”. This was me when I started coding professionally 😅.
Recently, I got to write the most complex SQL query I have ever seen in my life ( It’s in the GitHub gist towards the bottom of the post). The experience was very challenging. I have to admit, I thought about giving up many times while working on this. It was also fun (well… towards the end).
In this post, I am going to explain how I crafted this beast. And should you ever need to (I’m betting that you would), I will attempt to show you how you can too.
For context, Here is the task that required the SQL query.
In working on this challenge, I found that there are at least 3 major steps in writing complex SQL queries (performant or not). Firstly, I needed to understand the data, I needed to connect the data and finally, write the queries.
Understanding the data
This is the first step and a very important one. In my challenge, the data came in an SQL file and I had to run “migration” using NodeJS (I revealed how I did this in another post).
To understand the data, I took advantage of a database GUI tool — TablePlus. (TablePlus is an amazing software for SQL and NoSQL databases; From PostgreSQL, MySQL, SQLite, etc.).
Since I did not design the database myself, this step was crucial.
As instructed in the task, the result of my query should have:
- the District
- the Season
- the Sector
- the Site
- GroupCount (each site has groups of farmers, I would be getting the number per site)
- ClientCount (each site has clients (farmers), they need to be active ones)
- TotalLandSize ( the total size of land per site which should not include the add-ons like street-lights which have their value as -1)
- The result should be ordered alphabetically by sector and then by site
NB: a Client in a Group; a Group is in Site; a Site is in a Sector; and a Sector is in a district.
Connecting the data
It took me about 3 hours or more and 2 iterations to figure out how the data is connected. Knowing how the data is connected informed me on the sequence of queries that I needed to run to be able to get to the final solution.
While going through the data, I used my notebook to take some notes about this sequence of the query.
Here are what I was able to find from this step:
- The id of the season and the real name of
Long rains season 2011
is in theSeasons
table - the
SeasonsClients
contains the SeasonID, clientID and groupID, so it can be connected toSeasons
,Clients
SeasonClientsInputChoices
andGroups
tables. - The
Clients
table is “useless” - The
SeasonClientsInputChoices
table contains ClientID, Acres and SeasonID, so I can get the land sizes per client for the season in view. - The
Groups
table contains SiteID, Active. Since clients belong to groups it can be determined whether a client isactive
. Also, from the SiteID it can be connected to theSites
table. - The
Sites
table contains SiteName and SectorID, so it can be connected toSectors
table for me to get the sector name - The
Sectors
table contains SectorName and DistrictID. (There is only one disrict, it’s districtID is 5). - The
Districts
table contains the DistrictName and DistrictID.
From these connections, I could get the District
, Season
, Sector
, Sites
, GroupCount
, ClientsCount
and TotalLandSize
. That is everything that the final result is required to have.
Writing the query
After determining the connection between the various tables and data, the next step was to attempt the query.
The “pseudo-query”
I started with a “pseudo-query”. It looks something like what we have below: (don’t take it too seriously, it was meant to guide my thought process 😅)
- Select all from SeasonClients where seasonid = 80 (for long rains 2011)
- Select count as groupCount from group by groupid and active = 1
- Select site by siteID (order alphabetically)
- Select sectorname, districtId by sectorId (order alphabetically)
- Select districtName from district by districtID
- Select Clients and select count clients in active groups in the seasonClient table where season = 80 (long rains season 2011)
- Select and sum land sizes in seasonClientInputChoices as totalLandSizes
I used the indentation to show the preceding query that another query would depend on.
Looking at my “pseudo-query” everything looks simplified and easier, right? Don’t get carried away though, the actual query turned out 18 times longer than this!
The actual query
The entire query was made possible by the SQLite with
clause. With the clause, I was able to create multiple Common Table Expressions or CTEs.
CTEs act like temporary views that exist only for the duration of a single SQL statement
This is what my query looked like at the end (Note that this is a single select
statement).
getSeason
CTE
I started with a CTE called getSeason
. getSeason
contains SeasonName, GroupID, SeasonId and ClientID selected from both Seasons
and SeasonClients
tables.
2. getClients
CTE
The getClients
CTE contains the SiteID and all other contents of the getSeason
CTE after running a select statement to determine active groups in getSeason
.
3. getClientCountPerSite
CTE
As the name implies, this one gets the number of client per site, using the result from the getClients
CTE.
4. getGroupCountPerSite
CTE
This one gets the count of groups per site, also using the result from the getClients
CTE.
5. getTotalLandSizePerSite
CTE
This one is obvious, right?
6. getSites
CTE
This is where I put all of the sites-related data together. The sites-related data included the GroupCount, ClientCount, TotalLandSize, SiteName and SiteID. The CTE also contains the SectorID which I used in the next CTE (getSector
).
7. getSector
CTE
To get the SectorName and DistrictID
With the DistrictID, I made the actual select statement against the districts
table to get the DistrictName.
That is it. That’s how I did it!
This is a screen record of the result below.
If you look at the result closely, you’d notice that it took 681 milliseconds to run ( the average is 550ms). This is slow.
I am open to suggestions and contributions on how I could have done this challenge better. Please leave me a comment in the comment section. I’m here to learn too 😀.
Thanks for reading!
Acknowledgement
I got a lot of help from https://www.sqlitetutorial.net/
and https://www.sqlite.org/
. Like I mentioned before, this was a challenge. It was part of One Acre Fund’s recruitment process. I am grateful to have taken part in it.