I am currently working on a data science project. The Idea is to clean the data from “glassdoor_jobs.csv”, and present it in a much more understandable manner.

import pandas as pd df = pd.read_csv('glassdoor_jobs.csv') #salary parsing #Removing "-1" Ratings #Clean up "Founded" #state field #Parse out job description df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0) df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0) df = df[df['Salary Estimate'] != '-1'] Salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0]) minus_Kd = Salary.apply(lambda x: x.replace('K', '').replace('$','')) minus_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', '')) df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0])) df['max_salary'] = minus_hr.apply(lambda x: int(x.split('-')[1]))

I am getting the error at that last line. After digging a bit, I found out in minus_hr, some of the ‘Salary Estimate’ only has one number instead of range:

index | Salary Estimate |
---|---|

0 | 150 |

1 | 58 |

2 | 130 |

3 | 125-150 |

4 | 110-140 |

5 | 200 |

6 | 67- 77 |

And so on. Now I’m trying to figure out how to work around the “list index out of range”, and make max_salary the same as the min_salary for the cells with only one value.

I am also trying to get average between the min and max salary, and if the cell only has a single value, make that value the average

So in the end, something like index 0 would look like:

index | min | max | average |
---|---|---|---|

0 | 150 | 150 | 150 |

## Answer

You’ll have to add in a conditional statement somewhere.

df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]) if '-' in x else x)

The above might do it, or you can define a function.

def max_salary(cell_value): if '-' in cell_value: max_salary = split(cell_value, '-')[1] else: max_salary = cell_value return max_salary df['max_salary'] = minus_hr.apply(lambda x: max_salary(x)) def avg_salary(cell_value): if '-' in cell_value: salaries = split(cell_value,'-') avg = sum(salaries)/len(salaries) else: avg = cell_value return avg df['avg_salary'] = minus_hr.apply(lambda x: avg_salary(x))

Swap in min_salary and repeat