33
loading...
This website collects cookies to deliver better user experience
pandas
, ddlgenerator
and psycopg2
. These three enable us to clean a dataset and push it to a PostgreSQL database where the data can later be queried and exposed to a huge variety of company figures.happiness
and a subfolder named data
for storing the file files:mkdir -p happiness/data
cd happiness
happiness/data
folder.2015.csv
and 2016.csv
share the same format of column names and data. The 2017.csv
has the Region
column missing, some others named and ordered differently (Generosity
and Trust
are swapped). The same applies to the files 2018.csv
and 2019.csv
where we can't find the confidence intervals anymore. Moreover, the year is only contained in the file name, and not in a column.python -m venv python_pg
source python_pg/bin/activate
pip install pandas
prepare.py
in our happiness
folder and start including some Python code in it. We'll start by importing the library installed earlier, together with os
and glob
which should be installed by default and enable the interaction with the operative system and pathname pattern expansion respectively.import pandas as pd
import glob, os
all_years_df
to store the dataset once cleaned. Add the following lines to the prepare.py
file:column_names = ['survey_yr', 'country', 'overall_rank',
'score', 'gdp', 'generosity', 'freedom',
'social_support', 'life_exp', 'gov_trust']
all_years_df = pd.DataFrame()
data
folder. Add the following code to the prepare.py
file:files = glob.glob('data/*.csv')
for fp in files:
# Handling file names as new column
file_year = os.path.basename(fp).split('.')[0]
file_content = pd.read_csv(fp).assign(Year=int(file_year)-2000)
# Picking the right column for each file based on the year
uniformed_columns = find_columns(file_content, file_year)
# Conforming column names and appending the parsed year data to all_years_df
uniformed_columns.columns = column_names
all_years_df = pd.concat([all_years_df, uniformed_columns])
all_years_df.to_csv('results.csv', index=False)
glob
function returns the list of files in the happiness/data
folder, that we loop over. For each file, we read the filename with os.path.basename
, split the name on the dot (.
) and take the first part, extracting only the year information. The file_content
line uses pandas' read_csv
function to load the file data and populates the new column Year
(using the assign
function) with the last two digits of the year (int(file_year)-2000
).find_columns
function, which we'll define soon, to select the correct columns from the source file depending on the year.uniformed_columns
DataFrame containing only one year of data, and then append it to the all_years_df
DataFrame that will contain the whole dataset at the end of the process. Lastly we store the end result in a CSV file named results.csv
in the happiness
folder.find_columns
function definition where we consolidate the column names and order. As mentioned above, depending on the year, we need to select the correct columns from the source file. Add the following code to the prepare.py
file, just after the import
statements, to perform the trick on various years:def find_columns(file_content, file_year):
if file_year in ['2015', '2016']:
uniformed_columns = file_content[[
'Year', 'Country', 'Happiness Rank', 'Happiness Score',
'Economy (GDP per Capita)', 'Generosity', 'Freedom', 'Family',
'Health (Life Expectancy)', 'Trust (Government Corruption)'
]]
elif file_year in ['2017']:
uniformed_columns = file_content[[
'Year', 'Country', 'Happiness.Rank', 'Happiness.Score',
'Economy..GDP.per.Capita.', 'Generosity', 'Freedom', 'Family',
'Health..Life.Expectancy.', 'Trust..Government.Corruption.'
]]
else:
uniformed_columns = file_content[[
'Year', 'Country or region', 'Overall rank', 'Score',
'GDP per capita', 'Generosity', 'Freedom to make life choices',
'Social support', 'Healthy life expectancy',
'Perceptions of corruption'
]]
return uniformed_columns
happiness
folder let's execute the following:python prepare.py
results.csv
in the happiness
folder to verify that the code worked. The results should look like this:avn service create demo-pg \
-t pg \
--cloud google-europe-west3 \
-p hobbyist
demo-pg
on google-europe-west3
using a hobbyist
plan. You can review all PostgreSQL plans in our pricing page. To connect to it, we'll need the service uri
containing the information like host
, port
, and the default avnadmin
user's password
. We can retrieve it with:avn service get demo-pg --format '{service_uri}'
avn service wait demo-pg
pip install psycopg2 ddlgenerator
push.py
file and add the import and connect to the database statements. Replace in the following code the <SERVICE_URI_OUTPUT>
parameter with the output of the avn service get
command above:import psycopg2
conn = psycopg2.connect('<SERVICE_URI_OUTPUT>')
ddlgenerator
provides a nice functionality to create both table DDLs and insert statements from a CSV file. We can append the following code to the push.py
file:from ddlgenerator.ddlgenerator import Table
table = Table('results.csv', table_name='HAPPINESS')
sql = table.sql('postgresql', inserts=True)
Table
class from ddlgenerator
, the code creates an instance parsing the results.csv
file, with a target table name of HAPPINESS
. The last line creates the CREATE
table definition SQL for PostgreSQL, including the insert statements for each line (inserts=True
). If we print the SQL statement it should look like the following:DROP TABLE IF EXISTS happiness;
CREATE TABLE happiness (
survey_yr INTEGER NOT NULL,
country VARCHAR(24) NOT NULL,
overall_rank INTEGER NOT NULL,
score DECIMAL(15, 14) NOT NULL,
gdp DECIMAL(17, 16) NOT NULL,
generosity DECIMAL(16, 16) NOT NULL,
freedom DECIMAL(16, 16) NOT NULL,
social_support DECIMAL(16, 15) NOT NULL,
life_exp DECIMAL(17, 16) NOT NULL,
gov_trust DECIMAL(16, 16)
);
INSERT INTO happiness (survey_yr, country, overall_rank, score, gdp, generosity, freedom, social_support, life_exp, gov_trust) VALUES (19, 'Finland', 1, 7.769, 1.34, 0.153, 0.596, 1.587, 0.986, 0.393);
INSERT INTO happiness (survey_yr, country, overall_rank, score, gdp, generosity, freedom, social_support, life_exp, gov_trust) VALUES (19, 'Denmark', 2, 7.6, 1.383, 0.252, 0.592, 1.573, 0.996, 0.41);
...
push.py
filecur = conn.cursor()
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
happiness
folder, run the following:python push.py
happiness
table. Create a new Python file query.py
with the following content (as before, amend the connection service uri
):import psycopg2
conn = psycopg2.connect('<SERVICE_URI_OUTPUT>')
cur = conn.cursor()
cur.execute('''
SELECT SURVEY_YR, COUNTRY, OVERALL_RANK
from HAPPINESS
WHERE OVERALL_RANK <= 3
ORDER BY SURVEY_YR, OVERALL_RANK
''')
print(cur.fetchall())
cur.close()
conn.close()
push.py
file, we are creating a connection to PostgreSQL, then executing a query to check which countries were in the top three position over the various years. The result should be similar to:[
(15, 'Switzerland', 1), (15, 'Iceland', 2), (15, 'Denmark', 3),
(16, 'Denmark', 1), (16, 'Switzerland', 2), (16, 'Iceland', 3),
(17, 'Norway', 1), (17, 'Denmark', 2), (17, 'Iceland', 3),
(18, 'Finland', 1), (18, 'Norway', 2), (18, 'Denmark', 3),
(19, 'Finland', 1), (19, 'Denmark', 2), (19, 'Norway', 3)
]
Finland
, has been on top in the last two years.pandas
, ddlgenerator
and psycopg2
libraries mentioned in the blog enable an easy data handling and push to PostgreSQL. This makes for a solid base for data discovery or corporate reporting.