Exploring COVID-19 Data with SQLite: Cases, Deaths, and Vaccination Insights
- Oluwakemi Oyefeso
- Jan 16
- 8 min read
Updated: Jan 28

In 2022, I completed this analysis as a guided project using Microsoft SQL. I decided to revisit and improve the project by implementing it in SQLite. Since I couldn't upload Excel files directly to SQLite, I first converted the data into CSV format, and then uploaded the files into SQLite for querying.

Exploring the Datasets
1. Total Deaths vs. Reported Cases:
To explore how deaths relate to reported cases by location, I calculated the percentage of deaths relative to the total cases reported per location.
SQL Query:
SELECT
location,
date,
total_cases,
total_deaths,
ROUND((total_deaths / NULLIF(total_cases, 0)) * 100, 2) AS PercentageDeaths--NULLIF prevents division by zero
FROM CovidDeaths
ORDER BY location, dateResult: The result shows the daily death percentage for each location.

For more specific insights, we can apply a WHERE clause to filter for a particular country, like Nigeria:
SELECT
location,
date,
total_cases,
total_deaths,
ROUND((total_deaths / NULLIF(total_cases, 0)) * 100, 2) AS PercentageDeaths -- NULLIF prevents division by zero
FROM CovidDeaths
WHERE location = 'Nigeria'
ORDER BY date DESC -- From the latest to the oldestResult: As of September 9, 2020, Nigeria had reported 55,632 total cases and 1,070 deaths, which is 1.92% of the total cases. Further investigation shows that Nigeria's highest death percentage was 3.31% on April 21, 2020.

For comparison, I analyzed countries like the United States, the United Kingdom, and the United Arab Emirates:
SELECT
location,
date,
total_cases,
total_deaths,
ROUND((total_deaths / NULLIF(total_cases, 0)) * 100, 2) AS PercentageDeaths -- NULLIF prevents division by zero
FROM CovidDeaths
WHERE location LIKE '%united%' -- Filters locations containing the word 'united'
ORDER BY location, date DESCResult:
The UAE has a lower fatality rate compared to the UK and US.


Fatality was higher in the UK in 2020 and begins to reduce towards the end of 2020

The US recorded a 3% death rate by September 8, 2020, which grew to 6% in May 2020.


2. Total Cases per Population:
Next, I examined how the total cases related to the population for various countries:
SELECT
location,
date,
total_cases,
population,
ROUND((total_cases / NULLIF(population, 0)) * 100, 2) AS PercentageCases -- NULLIF prevents division by zero
FROM CovidDeaths
WHERE location LIKE '%states%' -- Filters locations containing the word 'states'
ORDER BY location, date DESCResult:
Early in 2020, the reported cases in the US were below 1% of the population.

By late 2020 and 2021, the numbers began to increase significantly.

3. Highest Infection Rate vs. Population:
This query identifies locations with the highest percentage of reported cases relative to their population:
SELECT
location,
population,
MAX(total_cases) AS HighestCasesReported,
ROUND(MAX(total_cases / population) * 100, 2) AS PercentageOfPopulationInfected
FROM CovidDeaths
GROUP BY location, population
ORDER BY PercentageOfPopulationInfected DESC
Result:
Andorra had the highest case percentage at 17.13% of its population, followed by Montenegro at 15.51%. This table was downloaded for visualization in Power Bi.
For more specific insights, we can apply a WHERE clause to filter for a particular country, like Nigeria.
SELECT
location,
population,
MAX(total_cases) AS HighestCasesReported,
ROUND(MAX(total_cases / population) * 100, 2) AS PercentageOfPopulationInfected
FROM CovidDeaths
WHERE location = 'Nigeria'
GROUP BY location, population
ORDER BY PercentageOfPopulationInfected DESCThis shows that the 'HighestCases' reported in Nigeria is 165,110 which was 0.08% of the population.

For the And Top 10 Countries with the highest cases, Teh SQL Query is:
SELECT
location,
population,
MAX(total_cases) AS HighestCasesReported,
ROUND(MAX(total_cases / population) * 100, 2) AS PercentageOfPopulationInfected
FROM CovidDeaths
GROUP BY location, population
ORDER BY PercentageOfPopulationInfected DESC
Limit 10Result:
The US, and Isreal are also part of the Top 10 countries with the highest number of reported cases.

Also added 'date' to the query:
SELECT
location,
population,
date,
MAX(total_cases) AS HighestCasesReported,
ROUND(MAX(total_cases / population) * 100, 2) AS PercentageOfPopulationInfected
FROM CovidDeaths
GROUP BY location, population,date
ORDER BY PercentageOfPopulationInfected DESC
Highest Count of Total Deaths RecordedResult:
This table was downloaded as CSV file for further visualization.
4. Highest Death Counts:
To explore the locations with the highest death tolls, I used the following query:
SELECT
location,
MAX(total_deaths) AS HighestDeathCounts
FROM CovidDeaths
GROUP BY location
ORDER BY HighestDeathCounts DESCResult:
The result shows awkward numbers.

After identifying that the 'total_deaths' column was stored as a TEXT format,

I cast it to an integer to obtain correct results:
SELECT
location,
MAX(CAST(total_deaths AS int)) AS HighestDeathCounts
FROM CovidDeaths
GROUP BY location
ORDER BY HighestDeathCounts DESCResult:
The table includes results for "World",

Looking through the 'coviddeaths' table again reveals that some locations lacked continent values (i.e value = NULL), which skewed the results as shown below:
SELECT* FROM CovidDeaths
ORDER BY continent
To correct this, I added "WHERE continent IS NOT NULL" to my query to adjust for the NULL values:
SELECT
location,
MAX(CAST(total_deaths AS int)) AS HighestDeathCounts
FROM CovidDeaths
WHERE continent is not null
GROUP BY location
ORDER BY HighestDeathCounts DESCResult:
United States has the highest number of deaths.

Deaths per continent:
I reversed the query by returning values where continent IS NULL to get the count of deaths per continent:
SELECT
location,
MAX(CAST(total_deaths AS int)) AS HighestDeathCounts
FROM CovidDeaths
WHERE continent IS null
GROUP BY location
ORDER BY HighestDeathCounts DESC
To remove 'World', 'European Union', and 'International' from the table, I adjusted the condition to exclude them:
SELECT
location,
MAX(CAST(total_deaths AS int)) AS HighestDeathCounts
FROM CovidDeaths
WHERE continent IS null
AND location NOT IN ('World','European Union','International')
GROUP BY location
ORDER BY HighestDeathCounts DESCResult:
Shows only the Highest Death Counts per continent and this was downloaded as csv file for visualization in Power Bi.
Global Numbers:
To calculate the total new cases and deaths worldwide, I aggregated the data:
SELECT
date,
SUM(new_cases) AS Tot_New_Cases,
SUM(CAST(new_deaths AS INT)) AS Tot_New_Deaths,
ROUND((SUM(CAST(new_deaths AS INT)) / NULLIF(SUM(new_cases), 0)) * 100, 2) AS Percentage_New_Deaths
FROM CovidDeaths
WHERE continent IS NOT NULL
GROUP BY date
Order By dateResult:
The result is in an unordered date sequence. I needed the date to be sequential. I discovered that the date is stored as 'text' type and in MM/DD/YYYY format.

Explanation:
To handle date sorting, I used "substr" to reformat the date into a standard "YYYY-MM-DD" format and cast it as a DATE type for sequential ordering.
SELECT
date,
SUM(new_cases) AS Tot_New_Cases,
SUM(CAST(new_deaths AS INT)) AS Tot_New_Deaths,
ROUND((SUM(CAST(new_deaths AS INT)) / NULLIF(SUM(new_cases), 0)) * 100, 2) AS Percentage_New_Deaths
FROM CovidDeaths
WHERE continent IS NOT NULL AND date IS NOT NULL
GROUP BY date
ORDER BY CAST(substr(date, 7, 4) || '-' || substr(date, 1, 2) || '-' || substr(date, 4, 2) AS DATE) ASC;Result:

So, these are the daily new cases, deaths and percentage deaths across the world.
Explanation.
Since the date is in the MM/DD/YYYY format:
substr(date, 7, 4): Starts at the 7th character in the date string (including the "/ " delimiter) and extracts 4 characters starting from the 7th position which is the Year part i.e. 01/11/2020 will extract 2020.
substr(date, 1, 2): Starts at the 1st character in the date string and extracts 2 characters starting from the 1st position which is the Month part i.e. 01/11/2020 will extract 01.
substr(date, 4, 2): Starts at the 4th character in the date string and extracts 2 characters starting from the 4th position which is the Day part i.e. 01/11/2020 will extract 11
|| '-' ||: Concatenates the extracted parts into the YYYY-MM-DD format. That's why the Order By statement starts which the year, followed month and then day.
CAST(... AS DATE): Converts the resulting string into a DATE data type.
Total Deaths Worldwide Query:
SELECT
SUM(new_cases) AS Tot_New_Cases,
SUM(CAST(new_deaths AS INT)) AS Tot_New_Deaths,
ROUND((SUM(CAST(new_deaths AS INT)) / NULLIF(SUM(new_cases), 0)) * 100, 2) AS Percentage_New_Deaths
FROM CovidDeaths
WHERE location = 'World'Result:
3.18 million deaths worldwide, a 2.11% of new cases reported in the world

5. Total Population vs. Vaccinated:
Finally, I explored vaccination data by joining the CovidDeaths and CovidVaccinations tables:
SELECT
d.continent,
d.location,
d.date,
d.population,
v.new_vaccinations
FROM CovidDeaths AS d
JOIN CovidVaccinations AS v
ON d.date = v.date
AND d.location = v.location
WHERE d.continent IS NOT NULL
ORDER BY d.location, CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE) ASCResult:
Vaccinations started in Albania on January 13, 2021, with 60 vaccinations recorded.

6. Running Total of Vaccinations:
I calculated the running total of vaccinations for each location, ensuring the total incremented by the daily vaccinations:
SELECT
d.continent,
d.location,
d.date,
d.population,
v.new_vaccinations,
SUM(CAST(v.new_vaccinations AS INT))
OVER (PARTITION BY d.location ORDER BY CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_vaccinations
FROM CovidDeaths AS d
JOIN CovidVaccinations AS v
ON d.date = v.date
AND d.location = v.location
WHERE d.continent IS NOT NULL
ORDER BY d.location, CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE) ASC;Explanation:
The running total resets when a new location appears in the dataset.
The query partitions the data by location and orders by date to calculate the cumulative sum.
Result:

On a new location (Algeria), it starts the rolling total again.

Explanation of Query:
I used a window function to create a running total by partitioning the data by location and ordering by date.
The SUM window function computes the running total of new_vaccinations, and the date conversion logic ensures the sum calculation is ordered correctly by date.
The clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW within the SUM window function ensures that the sum starts from the first day of new vaccinations and increments for each subsequent day within the same location. This is essential for ensuring the running total is cumulative per location.
UNBOUNDED PRECEDING means that the sum will start from the very first row of the partition (i.e., the first date recorded for the location's vaccinations) and will continue to accumulate for each subsequent row.
7. Vaccination Percentage (Cumulative):
Finally, I calculated the percentage of the population vaccinated cumulatively using a similar approach:
SELECT
d.continent,
d.location,
d.date,
d.population,
v.new_vaccinations,
SUM(CAST(v.new_vaccinations AS INT))
OVER (PARTITION BY d.location ORDER BY CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_vaccinations,
(SUM(CAST(v.new_vaccinations AS INT))
OVER (PARTITION BY d.location ORDER BY CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) / d.population) * 100 AS Percent_Running_Vaccinated
FROM CovidDeaths AS d
JOIN CovidVaccinations AS v
ON d.date = v.date
AND d.location = v.location
WHERE d.continent IS NOT NULL
ORDER BY d.location, CAST(substr(d.date, 7, 4) || '-' || substr(d.date, 1, 2) || '-' || substr(d.date, 4, 2) AS DATE) ASC;Result:
The percentage of the population vaccinated is shown cumulatively by location, providing insights into vaccination trends.

Conclusion:
This project not only enhanced my SQL skills but also provided valuable insights into the global impact of COVID-19 and vaccination progress. It showcases my ability to handle large datasets, manipulate data, and create meaningful visualizations and analyses.
Tables Downloaded:
Percentage Populated Infected Table
Percentage Populated Infected2 Table (CLV)
HighestDeathsPerContinent Table
TotalNewDeathPercentage Table
NEXT: Power Bi visualizations






Comments