The problem

Analyze 75 years of Formula 1 history — drivers, races, constructors, circuits, lap times. The raw data is 13 CSV files with primary/foreign-key relationships. Answer a mix of general competitive questions plus a focused career study of driver Fernando Alonso.

What makes this AE-relevant

Despite being a pandas project (not SQL), the work is fundamentally relational:

  • 13 source tables with PK/FK relationships
  • Multi-way joins (driver × race × results × constructors × circuits)
  • Head-to-head analytical tables built via stack/unstack — equivalent to pivoting in SQL
  • Filtering and aggregation across joined result sets

The same analytical work would slot directly into a SQL warehouse. The pandas form is a stepping stone — if this project were “real,” the next move would be loading the CSVs into Postgres and rewriting the analyses as CTEs.

Representative analyses

  • All-time top 5 winners / pole-sitters / fastest-lappers
  • Average positions gained per race (filtered by minimum-race-completions — classic pre-aggregation filter to remove small-sample noise)
  • Most-frequent DNF causes via DataFrame × status-code lookup
  • Head-to-head: Fernando Alonso vs. rivals — each row a race, each column a rival, each cell their position
  • Circuit-level speed ranking via groupby-then-apply (top-100 fastest laps per circuit)
  • Folium map of all circuit locations

Key decisions

Multi-table joins in pandas need explicit index handling

When merging with pd.merge, which indices carry through matters. Being explicit about left_on / right_on / right_index=True up front avoids end-of-query surprises — same philosophy as being explicit about JOIN predicates in SQL.

Filter before aggregating

df_results[df_results['statusId'] == 1] before groupby is cheaper than groupby-then-filter. Same optimization principle as pushing WHERE before GROUP BY.

Method chaining over intermediate variables

Several of the more complex aggregations (top-5 pole positions, average positions gained) are written as a single fluent chain. More readable than a stack of intermediate DataFrames, and mirrors how the same query would look in SQL.

Stack

  • Data: 13 CSVs from ergast.com, ~20MB on disk
  • Analytics: pandas, matplotlib, Folium
  • Visualization: horizontal bar charts, pie charts, KDE, geographic map