Equity analysis was a take-home case I was given as part of an interview process for a fin-tech company I interviewed for.
Excel file is a time-series data for public companies with a few relevant attributes. Also, some not relevant attributes were added to test a candidate’s ability to recognize essential and functional data for the task from irrelevant data.
Detailed commentary and steps are in iPython notebook comments, please see file below.
For download:
Raw Excel file
iPython notebook on GitHub
Tableau visualization workbook
Data exploration using iPython notebook
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('equity_data.csv')
In [3]:
df.head()
Out[3]:
Company Name | Date | Ticker | Returns | Capitalization | Mkt Cap | FCF Yield | Momentum | Sales Growth 1Y | Sales Growth 5Y | EPS Revision | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAC Holdings, Inc. | 12/31/2014 | AAC | -17.46 | 1 | 651.82 | NaN | NaN | 75.27 | 75.27 | 1.15 |
1 | AAC Holdings, Inc. | 1/30/2015 | AAC | 41.03 | 1 | 537.99 | NaN | NaN | 75.27 | 75.27 | 1.01 |
2 | AAC Holdings, Inc. | 2/27/2015 | AAC | -15.03 | 1 | 758.71 | -1.18 | NaN | 75.27 | 75.27 | 1.09 |
3 | AAC Holdings, Inc. | 3/31/2015 | AAC | 13.34 | 1 | 666.06 | -1.18 | NaN | 14.88 | 41.72 | 1.07 |
4 | AAC Holdings, Inc. | 4/30/2015 | AAC | 12.12 | 1 | 756.97 | -1.18 | NaN | 14.88 | 41.72 | 1.07 |
In [4]:
df.describe()
Out[4]:
Returns | Capitalization | |
---|---|---|
count | 377684.000000 | 379119.0 |
mean | 0.771014 | 1.0 |
std | 13.905732 | 0.0 |
min | -98.980000 | 1.0 |
25% | -5.450000 | 1.0 |
50% | 0.670000 | 1.0 |
75% | 6.630000 | 1.0 |
max | 890.390000 | 1.0 |
In [5]:
# Interestingly enough, only Returns and Capitalization came back. Let's check types of other columns
In [6]:
df.dtypes
Out[6]:
Company Name object Date object Ticker object Returns float64 Capitalization int64 Mkt Cap object FCF Yield object Momentum object Sales Growth 1Y object Sales Growth 5Y object EPS Revision object dtype: object
In [7]:
# As suspected, columns like Mkt Cap, Free Cash Flow etc are objects not numbers, let'a change that.
In [8]:
# First, let's rename column names with SPACE "_"
df.rename(columns = {'Company Name' : 'Company_Name',
'Mkt Cap' : 'Mkt_Cap',
'FCF Yield' : 'FCF_Yield',
'Sales Growth 1Y': 'Sales_Growth_1Y',
'Sales Growth 5Y': 'Sales_Growth_5Y',
'EPS Revision' : 'EPS_Revision'},
inplace = True)
In [9]:
# Quality check
df.dtypes
Out[9]:
Company_Name object Date object Ticker object Returns float64 Capitalization int64 Mkt_Cap object FCF_Yield object Momentum object Sales_Growth_1Y object Sales_Growth_5Y object EPS_Revision object dtype: object
In [10]:
# We can't convert NaN to number, so let's change NaN to zero because that will not affect out analysis.
df.isna().sum()
Out[10]:
Company_Name 0 Date 0 Ticker 0 Returns 1435 Capitalization 0 Mkt_Cap 57 FCF_Yield 15973 Momentum 12740 Sales_Growth_1Y 7960 Sales_Growth_5Y 5851 EPS_Revision 21802 dtype: int64
In [11]:
# change NaN to zero values
df.loc[df['Returns'].isnull(), 'Returns'] = '0'
df.loc[df['Mkt_Cap'].isnull(), 'Mkt_Cap'] = '0'
df.loc[df['FCF_Yield'].isnull(), 'FCF_Yield'] = '0'
df.loc[df['Momentum'].isnull(), 'Momentum'] = '0'
df.loc[df['Sales_Growth_1Y'].isnull(), 'Sales_Growth_1Y'] = '0'
df.loc[df['Sales_Growth_5Y'].isnull(), 'Sales_Growth_5Y'] = '0'
df.loc[df['EPS_Revision'].isnull(), 'EPS_Revision'] = '0'
In [12]:
# Quality check
df.isna().sum()
Out[12]:
Company_Name 0 Date 0 Ticker 0 Returns 0 Capitalization 0 Mkt_Cap 0 FCF_Yield 0 Momentum 0 Sales_Growth_1Y 0 Sales_Growth_5Y 0 EPS_Revision 0 dtype: int64
In [13]:
df.dtypes
Out[13]:
Company_Name object Date object Ticker object Returns object Capitalization int64 Mkt_Cap object FCF_Yield object Momentum object Sales_Growth_1Y object Sales_Growth_5Y object EPS_Revision object dtype: object
In [14]:
# Converting object to number/float
df['Returns'] = df.Returns.astype(float)
In [15]:
# Returns column is now float64
df.dtypes
Out[15]:
Company_Name object Date object Ticker object Returns float64 Capitalization int64 Mkt_Cap object FCF_Yield object Momentum object Sales_Growth_1Y object Sales_Growth_5Y object EPS_Revision object dtype: object
In [16]:
# Next let's do Mkt_Cap
df['Mkt_Cap'] = df.Mkt_Cap.astype(float)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-16-249e607076f8> in <module> 1 # Next let's do Mkt_Cap ----> 2 df['Mkt_Cap'] = df.Mkt_Cap.astype(float) /anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs) 5689 # else, only a single dtype is given 5690 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors, -> 5691 **kwargs) 5692 return self._constructor(new_data).__finalize__(self) 5693 /anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs) 529 530 def astype(self, dtype, **kwargs): --> 531 return self.apply('astype', dtype=dtype, **kwargs) 532 533 def convert(self, **kwargs): /anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs) 393 copy=align_copy) 394 --> 395 applied = getattr(b, f)(**kwargs) 396 result_blocks = _extend_blocks(applied, result_blocks) 397 /anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs) 532 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs): 533 return self._astype(dtype, copy=copy, errors=errors, values=values, --> 534 **kwargs) 535 536 def _astype(self, dtype, copy=False, errors='raise', values=None, /anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs) 631 632 # _astype_nansafe works fine with 1-d only --> 633 values = astype_nansafe(values.ravel(), dtype, copy=True) 634 635 # TODO(extension) /anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna) 700 if copy or is_object_dtype(arr) or is_object_dtype(dtype): 701 # Explicit copy, or required since NumPy can't view from / to object. --> 702 return arr.astype(dtype, copy=True) 703 704 return arr.view(dtype) ValueError: could not convert string to float: '1,130.74'
In [17]:
# There is a string error. We can't convert A to a number so, let's see what the problem is
In [18]:
# After inspecting '1,130.74' it seems that the comma is problematic, so let's strip it out
In [19]:
# Converting to float but also stipping the comma so we get a string number that we can convert
df['Mkt_Cap'] = df.Mkt_Cap.str.replace(',', '').astype(float)
In [20]:
# No error is a good sign
In [21]:
# Quality check proves we have sucessfully converted Mkt_Cap object/tring to float/number
df.dtypes
Out[21]:
Company_Name object Date object Ticker object Returns float64 Capitalization int64 Mkt_Cap float64 FCF_Yield object Momentum object Sales_Growth_1Y object Sales_Growth_5Y object EPS_Revision object dtype: object
In [22]:
df['FCF_Yield'] = df.FCF_Yield.str.replace(',', '').astype(float)
In [23]:
df['Momentum'] = df.Momentum.str.replace(',', '').astype(float)
In [24]:
df['Sales_Growth_1Y'] = df.Sales_Growth_1Y.str.replace(',', '').astype(float)
In [25]:
df['Sales_Growth_5Y'] = df.Sales_Growth_5Y.str.replace(',', '').astype(float)
In [26]:
df['EPS_Revision'] = df.EPS_Revision.str.replace(',', '').astype(float)
In [27]:
# Quality check = great
df.dtypes
Out[27]:
Company_Name object Date object Ticker object Returns float64 Capitalization int64 Mkt_Cap float64 FCF_Yield float64 Momentum float64 Sales_Growth_1Y float64 Sales_Growth_5Y float64 EPS_Revision float64 dtype: object
In [28]:
# Now we can perform some statistical checks, let's try it with describe function
df.describe()
Out[28]:
Returns | Capitalization | Mkt_Cap | FCF_Yield | Momentum | Sales_Growth_1Y | Sales_Growth_5Y | EPS_Revision | |
---|---|---|---|---|---|---|---|---|
count | 379119.000000 | 379119.0 | 379119.000000 | 379119.000000 | 379119.000000 | 3.791190e+05 | 3.791190e+05 | 3.791190e+05 |
mean | 0.768096 | 1.0 | 6231.498876 | 1.435006 | 11.024613 | 2.542597e+03 | 4.103473e+03 | -2.240330e+02 |
std | 13.879471 | 0.0 | 23489.381721 | 75.938681 | 67.903766 | 4.414335e+05 | 4.899940e+05 | 5.258837e+05 |
min | -98.980000 | 1.0 | 0.000000 | -10296.930000 | -99.640000 | -1.812420e+03 | -1.514150e+03 | -9.663676e+07 |
25% | -5.420000 | 1.0 | 384.810000 | -0.040000 | -15.710000 | -1.350000e+00 | 2.270000e+00 | 9.400000e-01 |
50% | 0.630000 | 1.0 | 1087.080000 | 4.020000 | 4.600000 | 6.700000e+00 | 9.070000e+00 | 1.000000e+00 |
75% | 6.600000 | 1.0 | 3498.320000 | 7.840000 | 28.130000 | 1.925000e+01 | 2.097000e+01 | 1.010000e+00 |
max | 890.390000 | 1.0 | 796466.150000 | 2769.110000 | 9580.560000 | 7.846261e+07 | 7.846261e+07 | 1.127429e+08 |
In [36]:
# We can see that we're finally operating with numbers, so math and stats operations are now possible
In [37]:
# Convert string to DateTime
df['Date'] = pd.to_datetime(df.Date)
In [38]:
# Counting number of records per year
pd.to_datetime(df['Date']).dt.year.value_counts()
Out[38]:
2015 36155 2014 36112 2016 35758 2013 35560 2010 35511 2012 35450 2011 35287 2009 35257 2008 35207 2007 35137 2017 20726 2006 2959 Name: Date, dtype: int64
In [39]:
# We can clearly see that year 2006 is incomplete with records
# Also 2017 looks the same, probably the data was gathered mid-2017
# 2006 should be excluded from analysis because it will give false information/trends/
# 2017 could be used as a trend and compared to previous years but for simplicity I will exclude it from analysis
In [40]:
# For curiosity I will check if all companies have the same # of records/data
df.groupby(df.Ticker.str.strip("'"))['Date'].nunique()
Out[40]:
Ticker @NA 2 A 128 AA 128 AABA 126 AAC 32 AACC 66 AAI.XX1 53 AAL 44 AAMC 24 AAMRQ 61 AAN 80 AAN.XX1 48 AAOI 44 AAON 128 AAP 128 AAPL 128 AAT 76 AATI 61 AAWW 128 AAXN 128 ABAT 36 ABAX 128 ABBC 58 ABBV 55 ABC 128 ABCB 128 ABCD 53 ABCO 128 ABCW 17 ABCWQ 30 ... ZBRA 128 ZEN 38 ZEP 91 ZEUS 128 ZFGN 32 ZG 70 ZGEN 46 ZGNX 80 ZIGO 90 ZINCQ 100 ZION 128 ZIOP 86 ZIP 21 ZIXI 98 ZLC 64 ZLTQ 64 ZN 24 ZNGA 64 ZNT 41 ZOES 38 ZOLL 64 ZOLT 87 ZQKSQ 105 ZRAN 57 ZSPH 14 ZTS 52 ZU 22 ZUMZ 128 ZYNE 10 ZZ 75 Name: Date, Length: 5495, dtype: int64
In [41]:
# An interesting finding, there are differences in a number of records per company, could be for many reasons
# we can use this knowledge for refinement in deeper analysis, for the time being let's just be aware of this
In [42]:
# Quick check on correlation across the data
df.corr(method ='pearson')
Out[42]:
Returns | Capitalization | Mkt_Cap | FCF_Yield | Momentum | Sales_Growth_1Y | Sales_Growth_5Y | EPS_Revision | |
---|---|---|---|---|---|---|---|---|
Returns | 1.000000 | NaN | -0.001339 | -0.013090 | -0.001583 | 0.001224 | 0.001000 | 0.002010 |
Capitalization | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Mkt_Cap | -0.001339 | NaN | 1.000000 | 0.013302 | 0.016542 | -0.001458 | -0.002078 | -0.000203 |
FCF_Yield | -0.013090 | NaN | 0.013302 | 1.000000 | 0.024217 | -0.005492 | -0.005630 | -0.000007 |
Momentum | -0.001583 | NaN | 0.016542 | 0.024217 | 1.000000 | 0.000167 | 0.000215 | 0.000746 |
Sales_Growth_1Y | 0.001224 | NaN | -0.001458 | -0.005492 | 0.000167 | 1.000000 | 0.900873 | 0.000002 |
Sales_Growth_5Y | 0.001000 | NaN | -0.002078 | -0.005630 | 0.000215 | 0.900873 | 1.000000 | 0.000004 |
EPS_Revision | 0.002010 | NaN | -0.000203 | -0.000007 | 0.000746 | 0.000002 | 0.000004 | 1.000000 |
In [43]:
# Capitalization has a value 1, so we can disregard it, but I will not drop it for file-originality reasons.
In [44]:
# I am happy how everything looks now and let's save this work to csv file.
df.to_csv('Cleaned_equity_data.csv')
In [45]:
# I can visualize in Matplotlib or Seaborn, but our file is fairly small so I will use Tableau
# END.



