The problem

Analyze the Valencia Airbnb market using production-style BI infrastructure: cloud PostgreSQL as the warehouse, a modern BI tool as the front-end. Two stakeholders — a prospective investor and an established property manager — each need a tailored dashboard answering their specific questions.

Stack

  • Warehouse: Supabase-hosted PostgreSQL (Session Pooler connection)
  • BI tool: Preset (managed Apache Superset)
  • Source: Inside Airbnb — 8,847 listings + 416,389 reviews for Valencia
  • Orchestration/ingest: Python + psycopg2 + pandas

Architecture

┌──────────────────┐
 Inside Airbnb    
 CSV.gz downloads 
└──────────┬───────┘
           
┌──────────────────┐    COPY / psycopg2
  Supabase Postgres│◄─────────────────
  apartamentos    
  reviews         
└──────────┬───────┘
                 CREATE TABLE AS SELECT
   ┌──────────────────┐
    apartamentos_clean    (proper types, selected cols)
    reviews_clean     
   └──────────┬───────┘
              
      ┌──────────────────┐
        Preset workspace 
        SQL Lab          
        2 dashboards     
      └──────────────────┘

Key decisions and trade-offs

Raw-to-clean modeled with CREATE TABLE AS

Instead of storing raw Airbnb data and wrangling types at query time in Preset, I materialized clean tables in Postgres via CREATE TABLE AS SELECT. Typed columns, scrubbed sentinels (\N, nan, empty strings), currency stripping from price, and host_is_superhost binarization all happen once at load rather than repeatedly per query.

CREATE TABLE public.apartamentos_clean AS
SELECT
    id::BIGINT AS id,
    neighbourhood_cleansed::VARCHAR(100) AS neighbourhood_cleansed,
    CAST(NULLIF(NULLIF(NULLIF(latitude, '\N'), ''), 'nan') AS FLOAT) AS latitude,
    CASE
        WHEN price IN ('\N', '', 'nan') THEN NULL
        ELSE CAST(REPLACE(REPLACE(price, '$', ''), ',', '') AS NUMERIC(10,2))
    END AS price,
    CASE
        WHEN host_is_superhost IN ('t', 'true', 'True', 'TRUE') THEN TRUE
        ELSE FALSE
    END AS host_is_superhost,
    -- ... 20+ other casts
FROM public.apartamentos;

This is the standard analytics-engineering pattern: push data-quality logic as close to the source as possible, so downstream queries (and BI tool users) get clean data by default.

Business-question framing, not just “SQL practice”

Every analytical query answers a concrete business question:

  • Where are the most expensive listings by property type?
  • How has review volume evolved over time?
  • Which neighborhoods offer the best value-for-money?
  • Where are large-portfolio hosts concentrated?
  • Which neighborhoods grew fastest in the last 3 years?

Each query is paired with a chart-type recommendation and a written analysis — because “what chart to build” is half the AE job.

Two dashboard personas, two different lens

Investor dashboard answers “where to invest?”:

  • Revenue per listing by neighborhood (choropleth map)
  • Review volume and average price over time (time series)
  • Rating vs. review count by neighborhood (scatter)
  • Year-over-year growth in supply (stacked bar)

Property manager dashboard answers “how is my portfolio performing?”:

  • Listing-level detail table with host filters
  • Per-listing control panel (reviews, rating, price)
  • Price vs. rating scatter to surface outliers

Same underlying dataset. Different aggregations. Different questions. Different storytelling.

Notable findings

  • 2020 COVID crater: Review volume cratered mid-2020, recovered through 2021, and now exceeds pre-pandemic levels
  • Superhost paradox: Superhosts charge less on average than non-Superhosts — possibly optimizing for occupancy over premium pricing
  • Fallas effect: March is a consistent review-volume peak across all years — a measurable tourism signal for the local event

Dashboard outputs

The final PDF exports of both dashboards are in the Resultados/ folder alongside the notebook.