My first day

I started my professional career working as a Data Analyst at a medium-sized fintech company. My first day at the job was October 28th, 2019 and boy I remember that day vividly.

My first day consisted on getting to the office and being introduced to the whole company (~250 members) in an All-Hands meeting that was celebrated in an auditorium. I also got the opportunity to go on a rather quick tour around the office along with other new hires, as well as getting my welcome kit. But there was no better feeling than being walked to my desk, and finding her.

She was beautiful, dressed in red, quiet and… old? Don’t get me wrong, being introduced to who was going to be my partner in crime was exciting. But realizing I had to work with a Celeron laptop that was five years old at that time, I figured I should be getting the most out of it.

So I logged in… … … It took more that one full minute to get to the desktop. Ok, nothing to be too worried about, on to the next thing, I pressed Ctrl + Alt + Supr, but nothing happened.

This baby was taking its sweet ol’ time to start responding, and after a good minute, it finally was alive. Task manager was in front of me, it was as if Pitbull started singing 2009’s hit song “I know you want me”. This baby was rocking four, yes, you read that right, 4 gigs of RAM.

The data role

Knowing that I would be working my first job with a somewhat slow laptop was a bit demotivating, but hey, first job so figured that was normal.

Being a month or two away from graduating as an Industrial Engineer, I was really excited to start working and earning some money for myself. I also wanted to learn a thing or two in this job, you know, things that are usually not taught in colleage.

Being totally honest, I was not formally educated in “programming” per-se, but I had taken a couple Java courses and a single database design course, in which I mainly learned SQL and a bit of C# through .NET.

That database course I really enjoyed, mainly because it really felt like I was doing something through the computer. I was always intrigued by the visual aspect of it by designing websites and some android applications.

So, my Data Analyst skills were primarily focused on the analysis itself rather than “knowing” how to pull the data to make the analysis, I figured I could do that via SQL, but there was nothing else I could do data-wise, or so I thought.

During my first few weeks at the job I was provided with some Excel files that had data in them. So, naturally I started asking questions about the data to generate a report/analysis. I quickly figured it would be easier if I looked at the data directly myself since I couldn’t relliably tell if the provided information was a subset of something bigger or if there were more fields that could help my research.

This implied getting access to some database that I knew nothing about, so I started asking out who could provide me which those credentials.

Tribal knowledge

After a few weeks of asking around, and being redirected to other departments and colleages, I ended up getting access to a MySQL database. At that moment in time I wasn’t aware of the existance of different flavors of SQL, nothing to be worried about since it’s basically the same language.

I had only interacted with a database via SSMS, so naively enough I downloaded it into my beauty of a laptop. I tried connecting without luck for around three days when I finally figured there was a native MySQL editor, downloaded it and had a change to start working on extracting/interacting with data.

The first thing to do was to look for the table that was used to generate the Excel file I already had – again, way too naive. As I spent a good two to three hours querying tables, trying to find which keys could be used to join tables and which columns were used to generate the file; I decided to ask for documentation.

Maybe I should’ve asked for that earlier, but I really wanted to have a small grasp of the database to start asking questions.

I was told there was no documentation since things were “pretty easy to follow”.

I promptly understood that this was going to be a very interesting way of working and learning. I wasn’t that discouraged to continue, but I knew it was going to be challenging trying to reconcile information since I was pretty sure that all of the database changes, updates, definitions, etc… were tribal knowledge.

Manual processes everywhere

Ok! Now that we sorted out access to the database and confirmed that there was no documentation to rely on we can continue on extracting the information that created the Excel files.

The first thing would be to ask something I should’ve asked about a week ago, but well I now had access to the database so, uh, progress. The question would be, who generated those files? I was pointed to a colleague who we will refer to as Kate.

Kate had been at the company for about 3 years now (half of the company’s lifetime) and in that time, Kate had created the majority – if not all – of reports that were consumed by finance, collections, and legal departments. She also worked closely with the CEO since she had created some reports that were quite complex and had quite the positive impact throughout the company, so Kate was trusted and very relevant within the company.

I reached out to Kate to learn how these files were created. To my surprise, Kate was very helpfull and shared with me the query that generated the file. She also shared some additional information as to what tables were the most used, how certain fields were defined (because some of them were the result of applying business rules), she even told me who was in charge of updating the tables.

Wait, there was a person that was responsible for updating tables in a production database? Yup, you read that right, again. A person was tasked with updating a production database on a daily basis. Well, not my problem at this point since all I want to do is know how to extract information to generate some Excel files.

SQL deep diving

Back to Kate’s files. The queries Kate shared with me were actually pretty clean, they had a few comments here and there, and, at the very least they were formatted. There was only one thing that bothered me from these files, they were about 400 lines long each (about 5 files in total).

The Excel files had about 10 columns each, so 40 lines per column. That was quite a lot but I mean, I knew close to nothing about the database so figured this would be the standard (little bit scary but nothing we cannot overcome).

Now that I had the queries I had a chance to learn a bit more about the database schema, tables, and fields.

I noticed a recurrent thing within the queries, they heavily relied on nested CTEs and subqueries to filter some tables. Here’s an example of what the queries were like:

WITH cta_1 AS (
  SELECT
    table_a_1.column_a,
    table_a_1.column_b,
    ...
  FROM
    table_i_barely_know_anything_about AS table_a_1
      INNER JOIN (
    SELECT
      table_a_2.column_a,
      MAX(table_a_2.column_c) AS column_c
    FROM
      table_i_barely_know_anything_about AS table_a_2
    GROUP BY column_a
  ) AS table_a_filtered ON
    table_a_1.column_a = table_a_filtered.column_a AND
    table_a_1.column_c = table_a_filtered.column_c
)
SELECT
  cta_1.column_a,
  cta_1.column_b,
  cta_1.column_c,
  SUM(cta_1.column_d) AS column_d
FROM
  cta_1
GROUP BY 1, 2, 3;

In short, the queries were getting information from table “a”, creating both a CTE and a sub-query, to essentially filter table “a”. I mean, it’s not wrong since I really didn’t know if these tables were partitioned in some way or certain fields behaved in some hidden way that impacted how we would need to write queries, etc…

No, that was not it. It was done like this because it seemed… right? At least that’s what Kate told me. So, I took it upon myself to improve these queries and try to improve them, you know, learn by doing.

Small changes for the win

I began working on improving these queries starting by creating diagrams, workflows, and documenting the business rules and logic that resulted in the creation of these Excel files.

Previously I shared a code example that was basically filtering a table in two steps, well the natural thing was to simplify that:

WITH cta_1 AS (
  SELECT
    column_a,
    column_c,
    ROW_NUMBER() OVER(PARTITION BY column_a, column_b) AS column_rn
  FROM
    table_i_barely_know_anything_about AS table_a_1
)
SELECT
  cta_1.column_a,
  cta_1.column_b,
  cta_1.column_c,
  SUM(cta_1.column_d) AS column_d
FROM
  cta_1
WHERE
  column_rn = 1
GROUP BY 1, 2, 3;

That’s a bit better, right? Well, at least it reduced the length of the files by about half of their original lenght. So now I had to deal with ~200 line files.

I continued my refactoring and improvements by standardizing column aliases, removing unnecessary duplicate queries and extending existing CTA’s to provide more information in the fewest reads as possible. Along with my refactoring I created documentation, not aiming to explain every single line of the query but more focused on explaining what information it tried to extract, what subsets of data we were interested on, etc… you get the point.

I was trying to set standards in a world where standards were not the norm. And boy, I was in for a ride.