top of page

Exploring COVID-19 Data with SQLite: Cases, Deaths, and Vaccination Insights

  • Writer: Oluwakemi Oyefeso
    Oluwakemi Oyefeso
  • Jan 16
  • 8 min read

Updated: Jan 28

Exploring COVID-19 Data with SQLite: Cases, Deaths, and Vaccination Insights

In 2022, I completed this analysis as a guided project using Microsoft SQL. I decided to revisit and improve the project by implementing it in SQLite. Since I couldn't upload Excel files directly to SQLite, I first converted the data into CSV format, and then uploaded the files into SQLite for querying.


CSV Files uploaded top SQLite
Percentage of Total Deaths vs. Reported Cases in the UK Table

Exploring the Datasets


1. Total Deaths vs. Reported Cases:

To explore how deaths relate to reported cases by location, I calculated the percentage of deaths relative to the total cases reported per location.


SQL Query:


SELECT 
    location, 
    date, 
    total_cases, 
    total_deaths, 
    ROUND((total_deaths / NULLIF(total_cases, 0)) * 100, 2) AS PercentageDeaths--NULLIF prevents division by zero
FROM CovidDeaths
ORDER BY location, date

Result: The result shows the daily death percentage for each location.

Total Deaths vs. Reported Cases Table
Percentage of Total Deaths vs. Reported Cases Table

For more specific insights, we can apply a WHERE clause to filter for a particular country, like Nigeria:


SELECT 
    location, 
    date, 
    total_cases, 
    total_deaths, 
    ROUND((total_deaths / NULLIF(total_cases, 0)) * 100, 2) AS PercentageDeaths -- NULLIF prevents division by zero
FROM CovidDeaths
WHERE location = 'Nigeria'
ORDER BY date DESC -- From the latest to the oldest

Result: As of September 9, 2020, Nigeria had reported 55,632 total cases and 1,070 deaths, which is 1.92% of the total cases. Further investigation shows that Nigeria's highest death percentage was 3.31% on April 21, 2020.

Percentage of Total Deaths vs. Reported Cases In Nigeria
Percentage of Total Deaths vs. Reported Cases In Nigeria

For comparison, I analyzed countries like the United States, the United Kingdom, and the United Arab Emirates:

SELECT 
    location, 
    date, 
    total_cases, 
    total_deaths, 
    ROUND((total_deaths / NULLIF(total_cases, 0)) * 100, 2) AS PercentageDeaths -- NULLIF prevents division by zero
FROM CovidDeaths
WHERE location LIKE '%united%' -- Filters locations containing the word 'united'
ORDER BY location, date DESC

Result:

The UAE has a lower fatality rate compared to the UK and US.


Percentage of Total Deaths vs. Reported Cases in the UAE, UK & USA Table
Percentage of Total Deaths vs. Reported Cases in the UAE, UK & USA Table

Percentage of Total Deaths vs. Reported Cases in the UAE, UK & USA Table

Fatality was higher in the UK in 2020 and begins to reduce towards the end of 2020

Percentage of Total Deaths vs. Reported Cases in the UK Table
Percentage of Total Deaths vs. Reported Cases in the UK Table

The US recorded a 3% death rate by September 8, 2020, which grew to 6% in May 2020.

Percentage of Total Deaths vs. Reported Cases in the US Table

Percentage of Total Deaths vs. Reported Cases in the US Table


2. Total Cases per Population:

Next, I examined how the total cases related to the population for various countries:

SELECT 
    location, 
    date, 
    total_cases, 
    population, 
    ROUND((total_cases / NULLIF(population, 0)) * 100, 2) AS PercentageCases -- NULLIF prevents division by zero
FROM CovidDeaths
WHERE location LIKE '%states%' -- Filters locations containing the word 'states'
ORDER BY location, date DESC

Result:

  • Early in 2020, the reported cases in the US were below 1% of the population.


Percentage Total Cases per Population in the US Table.
Percentage Total Cases per Population in the US Table
  • By late 2020 and 2021, the numbers began to increase significantly.

Percentage Total Cases per Population in the US Table

3. Highest Infection Rate vs. Population:

This query identifies locations with the highest percentage of reported cases relative to their population:


SELECT 
    location, 
    population, 
    MAX(total_cases) AS HighestCasesReported, 
    ROUND(MAX(total_cases / population) * 100, 2) AS PercentageOfPopulationInfected 
FROM CovidDeaths
GROUP BY location, population
ORDER BY PercentageOfPopulationInfected DESC

Result:

  • Andorra had the highest case percentage at 17.13% of its population, followed by Montenegro at 15.51%. This table was downloaded for visualization in Power Bi.

Percentage Highest Infection Rate vs. Population Table
Percentage Highest Infection Rate vs. Population Table

For more specific insights, we can apply a WHERE clause to filter for a particular country, like Nigeria.


SELECT 
    location, 
    population, 
    MAX(total_cases) AS HighestCasesReported, 
    ROUND(MAX(total_cases / population) * 100, 2) AS PercentageOfPopulationInfected 
FROM CovidDeaths
WHERE location = 'Nigeria'
GROUP BY location, population
ORDER BY PercentageOfPopulationInfected DESC

This shows that the 'HighestCases' reported in Nigeria is 165,110 which was 0.08% of the population.

Percentage Highest Infection Rate vs. Population in Nigeria Table
Percentage Highest Infection Rate vs. Population in Nigeria Table

For the And Top 10 Countries with the highest cases, Teh SQL Query is:


SELECT 
    location, 
    population, 
    MAX(total_cases) AS HighestCasesReported, 
    ROUND(MAX(total_cases / population) * 100, 2) AS PercentageOfPopulationInfected 
FROM CovidDeaths
GROUP BY location, population
ORDER BY PercentageOfPopulationInfected DESC
Limit 10

Result:

The US, and Isreal are also part of the Top 10 countries with the highest number of reported cases.


Top 10 Countries with the Highest Infection Rate vs. Population Table Percentage Table
Top 10 Countries with the Highest Infection Rate vs. Population Table Percentage Table

Also added 'date' to the query:


SELECT 
    location, 
    population, 
    date,
    MAX(total_cases) AS HighestCasesReported, 
    ROUND(MAX(total_cases / population) * 100, 2) AS PercentageOfPopulationInfected 
FROM CovidDeaths
GROUP BY location, population,date
ORDER BY PercentageOfPopulationInfected DESC
Highest Count of Total Deaths Recorded

Result:

This table was downloaded as CSV file for further visualization.

Highest Infection Rate vs. Population Table Percentage Table
Highest Infection Rate vs. Population Table Percentage Table

4. Highest Death Counts:

To explore the locations with the highest death tolls, I used the following query:

SELECT 
    location,  
    MAX(total_deaths) AS HighestDeathCounts 
FROM CovidDeaths
GROUP BY location
ORDER BY HighestDeathCounts DESC

Result:

The result shows awkward numbers.


Highest death counts table1

After identifying that the 'total_deaths' column was stored as a TEXT format,


total_deaths column data type

I cast it to an integer to obtain correct results:


SELECT 
    location,  
    MAX(CAST(total_deaths AS int)) AS HighestDeathCounts 
FROM CovidDeaths
GROUP BY location
ORDER BY HighestDeathCounts DESC

Result:

  • The table includes results for "World",


Highest Death Counts Table
  • Looking through the 'coviddeaths' table again reveals that some locations lacked continent values (i.e value = NULL), which skewed the results as shown below:


SELECT* FROM CovidDeaths
ORDER BY continent
CovidDeaths table

  • To correct this, I added "WHERE continent IS NOT NULL" to my query to adjust for the NULL values:


SELECT 
    location,  
    MAX(CAST(total_deaths AS int)) AS HighestDeathCounts 
FROM CovidDeaths
WHERE continent is not null
GROUP BY location
ORDER BY HighestDeathCounts DESC

Result:

United States has the highest number of deaths.

Death counts per location

Deaths per continent:

I reversed the query by returning values where continent IS NULL to get the count of deaths per continent:


SELECT 
    location,  
    MAX(CAST(total_deaths AS int)) AS HighestDeathCounts 
FROM CovidDeaths
WHERE continent IS null
GROUP BY location
ORDER BY HighestDeathCounts DESC
Deaths per continent table

To remove 'World', 'European Union', and 'International' from the table, I adjusted the condition to exclude them:

SELECT 
    location,  
    MAX(CAST(total_deaths AS int)) AS HighestDeathCounts 
FROM CovidDeaths
WHERE continent IS null
AND location NOT IN ('World','European Union','International')
GROUP BY location
ORDER BY HighestDeathCounts DESC

Result:


Shows only the Highest Death Counts per continent and this was downloaded as csv file for visualization in Power Bi.

Highest Death Counts per continent Table
Highest Death Counts per continent Table

Global Numbers:

To calculate the total new cases and deaths worldwide, I aggregated the data:

SELECT  
    date, 
    SUM(new_cases) AS Tot_New_Cases, 
    SUM(CAST(new_deaths AS INT)) AS Tot_New_Deaths,
    ROUND((SUM(CAST(new_deaths AS INT)) / NULLIF(SUM(new_cases), 0)) * 100, 2) AS Percentage_New_Deaths 
FROM CovidDeaths
WHERE continent IS NOT NULL
GROUP BY date
Order By date

Result:

The result is in an unordered date sequence. I needed the date to be sequential. I discovered that the date is stored as 'text' type and in MM/DD/YYYY format.

 total new cases vs deaths worldwide

Explanation:

  • To handle date sorting, I used "substr" to reformat the date into a standard "YYYY-MM-DD" format and cast it as a DATE type for sequential ordering.


SELECT 
    date, 
    SUM(new_cases) AS Tot_New_Cases, 
    SUM(CAST(new_deaths AS INT)) AS Tot_New_Deaths,
    ROUND((SUM(CAST(new_deaths AS INT)) / NULLIF(SUM(new_cases), 0)) * 100, 2) AS Percentage_New_Deaths
FROM CovidDeaths
WHERE continent IS NOT NULL AND date IS NOT NULL
GROUP BY date
ORDER BY CAST(substr(date, 7, 4) || '-' || substr(date, 1, 2) || '-' || substr(date, 4, 2) AS DATE) ASC;

Result:

Percentage of Total Deaths Vs Total New Cases Table
Percentage of Total Deaths Vs Total New Cases Table

So, these are the daily new cases, deaths and percentage deaths across the world.


Explanation.

Since the date is in the MM/DD/YYYY format:

  1. substr(date, 7, 4): Starts at the 7th character in the date string (including the "/ " delimiter) and extracts 4 characters starting from the 7th position which is the Year part i.e. 01/11/2020 will extract 2020.

  2. substr(date, 1, 2): Starts at the 1st character in the date string and extracts 2 characters starting from the 1st position which is the Month part i.e. 01/11/2020 will extract 01.

  3. substr(date, 4, 2): Starts at the 4th character in the date string and extracts 2 characters starting from the 4th position which is the Day part i.e. 01/11/2020 will extract 11

  4. || '-' ||: Concatenates the extracted parts into the YYYY-MM-DD format. That's why the Order By statement starts which the year, followed month and then day.

  5. CAST(... AS DATE): Converts the resulting string into a DATE data type.


Total Deaths Worldwide Query:

SELECT  
    SUM(new_cases) AS Tot_New_Cases, 
    SUM(CAST(new_deaths AS INT)) AS Tot_New_Deaths,
    ROUND((SUM(CAST(new_deaths AS INT)) / NULLIF(SUM(new_cases), 0)) * 100, 2) AS Percentage_New_Deaths 
FROM CovidDeaths
WHERE location = 'World'

Result:

3.18 million deaths worldwide, a 2.11% of new cases reported in the world

Worldwide Percentage of Total Deaths Vs Total New Cases Table
Worldwide Percentage of Total Deaths Vs Total New Cases Table

5. Total Population vs. Vaccinated:

Finally, I explored vaccination data by joining the CovidDeaths and CovidVaccinations tables:


SELECT 
    d.continent,
    d.location,
    d.date,
    d.population,
    v.new_vaccinations
FROM CovidDeaths AS d
JOIN CovidVaccinations  AS v
ON d.date = v.date
AND d.location = v.location
WHERE d.continent IS NOT NULL
ORDER BY d.location, CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE) ASC

Result:

Vaccinations started in Albania on January 13, 2021, with 60 vaccinations recorded.

Total Population vs. Vaccinated in Albania Table
Total Population vs. Vaccinated in Albania Table

6. Running Total of Vaccinations:

I calculated the running total of vaccinations for each location, ensuring the total incremented by the daily vaccinations:


SELECT 
    d.continent,
    d.location,
    d.date,
    d.population,
    v.new_vaccinations,
    SUM(CAST(v.new_vaccinations AS INT)) 
        OVER (PARTITION BY d.location ORDER BY CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE)
              ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_vaccinations
FROM CovidDeaths AS d
JOIN CovidVaccinations AS v
    ON d.date = v.date
    AND d.location = v.location
WHERE d.continent IS NOT NULL
ORDER BY d.location, CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE) ASC;

Explanation:

  • The running total resets when a new location appears in the dataset.

  • The query partitions the data by location and orders by date to calculate the cumulative sum.


Result:


Running Total of Vaccinations Table
Running Total of Vaccinations Table

On a new location (Algeria), it starts the rolling total again.


Percentage Running Total of Vaccinations Table

Explanation of Query:


  • I used a window function to create a running total by partitioning the data by location and ordering by date.

  • The SUM window function computes the running total of new_vaccinations, and the date conversion logic ensures the sum calculation is ordered correctly by date.

  • The clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW within the SUM window function ensures that the sum starts from the first day of new vaccinations and increments for each subsequent day within the same location. This is essential for ensuring the running total is cumulative per location.

  • UNBOUNDED PRECEDING means that the sum will start from the very first row of the partition (i.e., the first date recorded for the location's vaccinations) and will continue to accumulate for each subsequent row.


7. Vaccination Percentage (Cumulative):

Finally, I calculated the percentage of the population vaccinated cumulatively using a similar approach:


SELECT 
    d.continent,
    d.location,
    d.date,
    d.population,
    v.new_vaccinations,
    SUM(CAST(v.new_vaccinations AS INT)) 
        OVER (PARTITION BY d.location ORDER BY CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE)
              ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_vaccinations,
    (SUM(CAST(v.new_vaccinations AS INT)) 
        OVER (PARTITION BY d.location ORDER BY CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE)
              ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / d.population) * 100 AS Percent_Running_Vaccinated
FROM CovidDeaths AS d
JOIN CovidVaccinations AS v
    ON d.date = v.date
    AND d.location = v.location
WHERE d.continent IS NOT NULL
ORDER BY d.location, CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE) ASC;

Result:

  • The percentage of the population vaccinated is shown cumulatively by location, providing insights into vaccination trends.


Percentage Running Total of Vaccinations Table
Percentage Running Total of Vaccinations Table


Conclusion:

This project not only enhanced my SQL skills but also provided valuable insights into the global impact of COVID-19 and vaccination progress. It showcases my ability to handle large datasets, manipulate data, and create meaningful visualizations and analyses.


Tables Downloaded:

  • Percentage Populated Infected Table

  • Percentage Populated Infected2 Table (CLV)

  • HighestDeathsPerContinent Table

  • TotalNewDeathPercentage Table


NEXT: Power Bi visualizations

Comments


bottom of page