Analysing survey data with Python: Quick guide

Read Time:4 Minute, 56 Second

To get deep insights and understanding of the market, survey is considered to be the best tool. Conducting polls and surveys help us in data collection. It increases the probability to land on a space where intended questions are successfully answered.

For example, what is the one best thing that customers like about my business? Or why are my customers getting inclined to the new entrant? Analysing data is a real time challenge when it comes to survey.

So here is a boost for you. You are going to have a quick walk through this piece to learn data analysis through Python. Don’t worry even if you have never done coding before or haven’t taken up Python training. We’ll make sure you can absorb this. It’ll be a step by step procedure and by the end you’ll feel powered to unlock impressive analytical skills. That too with few lines of coding!

While working in market research, a major chunk of your time is spent dealing with the survey data. This database is often available as SAV or SPSS files.

SPSS is considered great for statistical analysis of survey data because variables, variable labels, values, and value labels integrated in one dataset.

With SPSS, categorized variables are easy to analyse. But unfortunately, SPSS is slower while working on larger data sets and the macro system for automation offers just a few options compared to Python. Therefore, having knowledge on survey data analysis with Python is an addition to skills. You can always opt for a Python training for the same.

Setup:

The first step is to install pyreadstat module, which will enable us to import SPSS files as DataFrames pip install pyreadstat.

Reading the Data:

The next step would be to import the module into a Jupyter notebook and load the dataset.

Our DataFrame:

It is difficult to read out much information, because we do not know the exact meaning of variables and the numerical information here. The meta container includes all other data, such as the labels and value labels.

With meta.column_labels we can print the variable labels.

For the column Sat_overall the matching label is “How satisfied are you overall?”.

With a few variables, one can easily assign it from the list. It would be confusing if we had hundreds of variables. Therefore, it is necessary to first create a dictionary. This is done so that we can selectively display the correct label for a column if necessary.

Unweighted Data:

While preparing a report from a conducted survey, the most sorted output is to note the percentage of alike respondents, who have opted for a specific answer:

df[‘Age’].value_counts(normalize=True).sort_index()

From the output we can only read that a particular percentage of people have voted for different categories. However, in the dictionary meta.value_labels we have all value labels.

It is preferable to sort it according to the order of the value labels. Currently, the values are sorted by the size of the proportions

df[‘Age’].map(meta.variable_value_labels[‘Age’]).value_counts(normalize=True).loc[meta.variable_value_labels[‘Age’].values()]

Now this is what we need. The result pretty is similar to the output of an SPSS “Frequency”.

Survey data is often evaluated according to sociodemographic characteristics.

Weighted Data

While conducting surveys, it is generally found that the distribution of sociodemographic characteristics does not correspond to the distribution in the customer base. This is the reason; we weight our data to reflect this distribution:

weight = np.NaN

df.loc[(df[‘Age’] == 1), ‘weight’] = 0.5/(67/230)

df.loc[(df[‘Age’] == 2), ‘weight’] = 0.25/(76/230)

df.loc[(df[‘Age’] == 3), ‘weight’] = 0.25/(87/230

But how can we now take this weight into account while making calculations? For the frequency distribution, we will write a small helper function:

def weighted_frequency(x,y):

a = pd.Series(df[[x,y]].groupby(x).sum()[y])/df[y].sum()

b = a.index.map(meta.variable_value_labels[x])

c = a.values

df_temp = pd.DataFrame({‘Labels’: b, ‘Frequency’: c})

return df_temp

After this, in the result we get a DataFrame with the respective labels and the corresponding percentage frequency:

weighted_frequency(‘Age’,’weight’)

The weighted distribution now corresponds to the customer structure. We see that we would have underestimated the oddness in our customer base if we had not weighted our data. Using crosstabs, a weight can easily be integrated.

pd.crosstab(df[‘Sat_overall’]. \

map(meta.variable_value_labels[‘Sat_overall’]), \

df[‘Age’].map(meta.variable_value_labels[‘Age’]),

df.weight, aggfunc = sum, dropna=True, \

normalize=’columns’). \

loc[meta.variable_value_labels[‘Sat_overall’].values()]. \

loc[:,meta.variable_value_labels[‘Age’].values()]*100

All you have to do now is adding parameters for weight (e.g. df.weight) and function aggfunc=sum.

Conclusion

In the beginning, we had installed pyreadstat, a module with which we can read SAV-files in Python and process them. After that, we followed by looking into the process of how labels and value labels can be assigned and how analysis can be presented in an easy way. Make sure the interpretation is clear.

Python handles categorized data very well and it is easy to use as well. All you need to do is a little practice that will make you more comfortable. A Python data science course can add more confidence along with skills.

Learn Python data science course at PST Analytics which is an instructor-led live online analytics training certification institute based in Delhi and Gurgaon.

It offers professional Analytics certification courses to beginners, advanced programmers and experts, who want to improve their knowledge of Applied analytics certification. The data-driven certification courses welcome programmers and offer in-depth studying programs for any level of difficulty.

Following passion & acquiring skillset demands one first step, so when are you taking yours?