This blog post was originally titled ChatBots…but I soon realised there was a LOT more to unpack than I first imagined. It’s definitely a topic that needs to be broken down, so let’s start with the 3 different types: rule-based chatbots, retrieval-based chatbots and deep-learning/ generative chatbots (fig3). The last is something I’m aspiring to, but let’s follow the process and learn the basics first.

For reference, I am following the Build Chatbots with Python path on Codecademy.

I’ll be honest – things started to get really complicated and I had to take it super slow to get a solid understanding of the concepts. Either way, I got through this section, made some fancy notes and have started my own off-platform build. Super cool stuff!

Building an if this then that style bot is actually pretty simple. Once you understand the syntax you can keep adding new rules and processes to create the system you want. Small snippets of code is one thing, but as you start to build out programs I’ve found it VERY useful to annotate with #comments as things can get confusing and fast!

I am working on perfecting a coffee ordering system that hopefully I can share with the blog soon enough. (I may have to figure out GitHub which is probably a whole project in itself…)

Essentially you can plan this type of chatbot in a huge decision tree then code out all the responses. Sprout Social have a similar ‘bot-builder‘ function that i’ve used in the past. It’s also in a nice GUI (graphical user interface) to make it more accessible. In the future this could also be a nice little side project.

Text Preprocessing

Where things started to get REALLY interesting was learning how to preprocess text! What’s great about this is I can scrape content from a webpage, pull data from a API, even input via voice recognition; process it so it can be read by a computer and then return a result. This is almost a mini-alexa style bot and is super exciting.

Before I get into some of the process, here are my notes on this topic:

Essentially, breaking down any text into a format we can read requires a few different steps. To execute any of this requires the NLTK (Natural Language Toolkit) platform. An explanation of which can be found on their site, nltk.org:

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

All of these steps are known as Text Preprocessing and involve some of the following:

  • Noise Removal – This is exactly what it sounds like and involves removing any ‘crap’ that comes along with scraped text. This could be HTML tags, formatting markup, quotation marks etc.

As an example, after importing re, we can run the following to remove all H1 tags in a text:

re.sub(r"<.?H1>", "", variable_string)

The r is required to indicate RAW text.

  • Tokenization* – This is the process of breaking up a large volume of text into manageable words (word.tokenize) or sentences (sent.tokenize). These are referred to as tokens.

After importing tokenize, you can call the function using the following:

from nltk.tokenize import word_tokenize
variable = word_tokenize(variable_string)

This creates a new variable that tokenizes the content from variable_string.

  • Normalization* – described best from Codecademy: In natural language processing, normalization encompasses many text preprocessing tasks including stemming, lemmatization, upper or lowercasing, and stopwords removal.

Checkout my notes for a deeper look into various normalization techniques. I plan to use all of these features in my off-site build.

Many of these normalization features allow you to give greater meaning to some parts of the text and eliminate the ‘filler words’. This is super useful when you need to disseminate and analyse user input. It’s also very useful for validation.

*Side note – the american english spelling of these words really frustrates me.

I’ve included the above chart as reference for the future. Currently i’m working on a rule-based, closed domain chatbot built on a simple dialog tree. The end goal is to build a generative, open domain, AI powered bot.

I’m only half way through this bot-building course on Codecademy so no doubt it gets into the more complex stuff. Writing my notes alongside these review posts is working great in terms of understanding the concepts.

Hopefully it helps!

T3B

Share this post