DL model to predict emotion behind a spoken sentence (Sentiment Analysis!)

In this blog, I’ll share the process of building a speech emotion recognition system through which we can predict an emotion from a set of 8 emotions such as; happy, sad, angry, disgust, and more.

This is a two-part series blog for ease of understanding. This first part talks about business and machine learning problems. The second blog will cover the solution part. Following is the first blog layout for your reference:

1. Introduction

2 Business Problem

3 Data

4 Mapping to Machine Learning problem

1. Introduction

There are numerous ways and methods and requirements for facial emotion recognition using facial expressions. But, in the last few years, there has been a surge in textual and speech data and a lot of applications are getting built using such data. So, in this case study, we will focus our attention on speech and will try to depict an emotion using only speech.

2. Business Problem

2.1 Description

We will use a RAVDESS dataset (both speech and song) and a TESS dataset to solve the problem. RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. Speech emotions include neutral, calm, happy, sad, angry, fearful, surprised, and disgusted expressions.

TESS dataset is a set of 200 target words spoken in the carrier phrase “Say the word _____’ by two actresses (aged 26 and 64 years) and recordings were made of the set portraying each of seven emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral)

2.2 Context

Sentiment analysis using speech is instrumental in businesses where we can understand the inner emotion of a participant just by their speech. Now in times like Covid-19 where most of the schools are shut and most of the counseling sessions take place over a call, an application like this which depicts participants’ emotions is a win-win situation for both. Not only this but applicability can also be in other medical genres and one of the most famous use cases is for Call-Centers where the client’s mood is depicted already and relevant measures are then taken.

2.3 Problem Statement:

Classify the emotion as one of the 8 emotions such as sad, happy, angry, and more, using a speech sentence.

2.4 Business objectives and constraints

  • Good accuracy requirement
  • Low-latency requirement

3. Data

3.1 Data Preparation


All the data for this project was collected from Kaggle, Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Toronto emotional speech set (TESS).

RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Eight emotions were selected for speech: neutral, calm, happy, sad, angry, fearful, surprised, and disgust


These stimuli were modeled on the Northwestern University Auditory Test №6 (NU-6; Tillman & Carhart, 1966). A set of 200 target words were spoken in the carrier phrase “Say the word _____’ by two actresses (aged 26 and 64 years)

In total we have 2800 files from the TESS dataset, 1012 files from the RAVDESS song, and 1440 from the RAVDESS speech dataset, summing up to 5252 files.

3.2 Data Overview and Description

Link to download Ravdess speech and song data:

RAVDESS Emotional speech audio

Emotional speech dataset


File naming convention for RAVDESS speech Dataset:

Each of the 1440 files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 03–01–06–01–02–01–12.wav)

Filename identifiers

  • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
  • Vocal channel (01 = speech, 02 = song).
  • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
  • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the ‘neutral’ emotion.
  • Statement (01 = “Kids are talking by the door”, 02 = “Dogs are sitting by the door”).
  • Repetition (01 = 1st repetition, 02 = 2nd repetition).
  • Actor (01 to 24. Odd-numbered actors are male, even-numbered actors are female).

Links to download TESS data:


These stimuli were modeled on the Northwestern University Auditory Test No. 6 (NU-6; Tillman & Carhart, 1966). A set of…


4. Mapping the real-world problem to an ML problem

4.1 Type of Machine Learning Problem

There are 5252 rows of data present with the class label as emotion, we have to predict what is the sentiment of the particular speech sentence. So, this problem can be posed as a classification problem.

4.2 Performance Metric

The most suitable KPIs for our regression problem are:

  1. Precision
  2. Recall
  3. F1 score

Precision and Recall

F1 Score

4.3 Machine Learning Objectives

Objective: Classify the sentiment (integer value) for each data point (speech) and try to increase the Precision, Recall, and F1 scores.

I hope that this simple blog clarifies the business and machine learning objectives. At Akaike, we keep investing our time, energy, and resources into learning something new that accelerates innovations. Do check out Part 2 of this blog for solutions and more.