CS代写 MIE1624 – Winter 2022\Lecture 1 – Introduction\Python' – cscodehelp代写

US_Baby_Names-2010

Introductory Example¶

US Baby Names 2010¶

‘C:\Users\roman\OneDrive – University of Toronto\University of Toronto\MIE1624 – Winter 2022\Lecture 1 – Introduction\Python’

http://www.ssa.gov/oact/babynames/limits.html

import pandas as pd

names2010 = pd.read_csv(‘yob2010.txt’, names=[‘name’, ‘sex’, ‘births’])

name sex births
1 Sophia F 20477
4 Ava F 15300
… … … …
33833 Zymaire M 5
33834 Zyonne M 5
33835 Zyquarius M 5
33836 Zyran M 5
33837 Zzyzx M 5

33838 rows × 3 columns

Total number of birth in year 2010 by sex

names2010.groupby(‘sex’).births.sum()

F 1759010
M 1898382
Name: births, dtype: int64

Insert prop column for each group

# Integer division floors
births = group.births.astype(float)

group[‘prop’] = births / births.sum()
return group

name sex births prop
0 22731 0.012923
1 Sophia F 20477 0.011641
2 17179 0.009766
3 16860 0.009585
4 Ava F 15300 0.008698
… … … … …
33833 Zymaire M 5 0.000003
33834 Zyonne M 5 0.000003
33835 Zyquarius M 5 0.000003
33836 Zyran M 5 0.000003
33837 Zzyzx M 5 0.000003

33838 rows × 4 columns

names2010.describe()

births prop
count 33838.000000 33838.000000
mean 108.085348 0.000059
std 693.442991 0.000376
min 5.000000 0.000003
25% 7.000000 0.000004
50% 11.000000 0.000006
75% 29.000000 0.000016
max 22731.000000 0.012923

Verify that the prop clumn sums to 1 within all the groups

import numpy as np

np.allclose(names2010.groupby([‘sex’]).prop.sum(), 1)

Extract a subset of the data with the top 10 names for each sex

def get_top10(group):
return group.sort_values(by=’births’, ascending=False)[:10]
grouped = names2010.groupby([‘sex’])
top10 = grouped.apply(get_top10)

top10.index = np.arange(len(top10))

name sex births prop
0 22731 0.012923
1 Sophia F 20477 0.011641
2 17179 0.009766
3 16860 0.009585
4 Ava F 15300 0.008698
5 14172 0.008057
6 Abigail F 14124 0.008030
7 13070 0.007430
8 Chloe F 11656 0.006626
9 Mia F 10541 0.005993
10 21875 0.011523
11 Ethan M 17866 0.009411
12 17133 0.009025
13 Jayden M 17030 0.008971
14 16870 0.008887
15 16634 0.008762
16 16281 0.008576
17 15679 0.008259
18 Aiden M 15403 0.008114
19 15364 0.008093

top10.describe()

births prop
count 20.000000 20.000000
mean 16312.250000 0.008918
std 3013.830748 0.001641
min 10541.000000 0.005993
25% 15018.000000 0.008084
50% 16457.500000 0.008730
75% 17144.500000 0.009455
max 22731.000000 0.012923

Aggregate all birth by the first letter from name column

# extract first letter from name column
get_first_letter = lambda x: x[0]
first_letters = names2010.name.map(get_first_letter)
first_letters.name = ‘first_letter’

table = names2010.pivot_table(‘births’, index=first_letters,
columns=[‘sex’], aggfunc=sum)

first_letter
A 309608 198870
B 64191 108460
C 96780 168356
D 47211 123298
E 118824 102513

Normalize the table

table.sum()

F 1759010
M 1898382
dtype: int64

letter_prop = table / table.sum().astype(float)

Plot proportion of boys and girls names starting in each letter

%matplotlib inline
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop[‘M’].plot(kind=’bar’, rot=0, ax=axes[0], title=’Male’)
letter_prop[‘F’].plot(kind=’bar’, rot=0, ax=axes[1], title=’Female’, legend=False)
fig.tight_layout()