Category: Data

REgex and applications

REgex is an abbreviation for Regular Expression. REgex is a character pattern that is used to search for a specific text or a portion of text.

The purpose of the regex is usually to search for text matches and/or make replacements on this text. When working with a large text document or an extensive log file. REgex allows you to quickly search for the pattern you need.

The first thing we need to understand is that every letter, number and symbol is a character. A string is made up of several characters to form words or keys.

In REgex different symbols are used to identify the pattern we are looking for.

For example:

 \d  allows us to search for any number from 0 to 9.
 \D will Match with any character other than a number.

One of the best tools I've come across for testing REgex codes is https://regexr.com/

Trying the example above:

In the image on the left the Regex matched the first number it found. In the middle, with 3 numbers in a row. And on the right, with the first character that was not numerical.

The tool allows to make tests in real time, selecting the text that coincides with the regex expression that we put as input.

Regex has several text identification tools. The important thing is to understand their syntax. There are several ways to search for the same character set. And it is up to each one to find the easiest way to do it.

Since in Regex the symbols on the keyboard have their own meaning, it is imperative to use an 'escape character' to be able to search for a symbol.
If our expression looks for a point (.) we need to use the following syntax \.
Otherwise the dot (.) itself is used in regex to search for 'any character'.

Another important example is inclusive and exclusive groups. This is done by using straight keys [ ].

In this example we notice that only the characters abcfjusz are selected regardless of the order in which they are found between the straight keys.

Likewise the symbol (^) to deny a data set. Taking into consideration the previous example:

Denying the above set allows us to select all characters except for the set between straight braces.

Below is a list of the syntax along with a link to a page that provides REgex exercises for practice.

Issues with delimiters within flat files

When working with historical information from one or more older systems, which I will call as Legacys, there is a good chance that the information stored in these systems has some kind of error.

Occasionally, the user can save information with formatting errors, special characters, or any value that the system should validate before saving but is allowed to pass. On the other hand, it can happen that the stored information wants to be extracted to use it as reports, but, at the moment of extracting it, it can be noticed how the existence of certain characters within the information cause reading, loading and reporting errors.

When exporting a given range of columns from Legacy for use, it may be the case that the number of columns in the file does not match the number of items in the file header. For example:

2019/09/23$933.02Sale of supplies4253423531
2019/09/24$375.84Wiring Service2342432
2019/09/26$-583.73Payment of: Water, Electricity, Telephone3225253
2019/09/28$739.98, Auto Consumption2352344

In this table we can see how line number 3 and 4 present the comma symbol ( ,) within their information. If this file is exported for reporting or auditing purposes to a .csv format or a .txt format, a delimiter would be used, this is known as a plain text file.

In the case where the delimiter is a comma to separate each field. The original file would be read.

 Date , Amount , Description, Invoice ,ID
2019/09/23 , $933.02 , Sale of supplies , 425342353 , 1
2019/09/24 , $375.84 , Cabling Service , 234243 , 2
2019/09/26 , $-583.73 , Payment of: Water, Electricity, Telephone , 322525 , 3
2019/09/28 , $739.98 , , Auto consumption , 235234 , 4

Let's take a look at the following information line:
2019/09/26 , $-583.73 , Payment of: Water, Electricity, Telephone , 322525 , 3

As we can see, the comma is inside the description field, this field can be typed by a user manually and the comma is present to separate elements. The problem within this scenario arises when you want to read this information.

The programs that read and use this type of files ask to define the delimiter with which they must separate this long list of characters. If it is configured with a comma, it would read as follows:

2019/09/23$933.02Sale of supplies4253423531
2019/09/24$375.84Wiring Service2342432
2019/09/26$-583.73Payment of: Water Electricity Phone 322525 3
2019/09/28$739.98 Auto Consumption 235234 4

The comma found within the information is read as a delimiter, causing the description column to run to the right and causing what is known as a Inconsistency of columns this causes the readers to show errors or when loading the information, its integrity is affected.

To solve this type of problems 2 possible solutions can apply. In the first instance you can use a delimiter that is very unlikely to be found in the information. For example a vertical bar ( | ). The data would be read:

Date| Amount| Description | Invoice |ID
2019/09/23 | $933.02 |  Sale of supplies  | 425342353 | 1 
2019/09/24 | $375.84 | Wiring Service | 234243 | 2 
2019/09/26 | $-583.73 | P Payment of: Water, Electricity, Telephone | 32252525 | 3 
2019/09/28 | $739.98 | | ,Auto consumption | 235234 | 4 

In this way we would avoid separation within the description. The second recommendation is based on enclosing each column in single quotes (' ') or double quotes.

"Date" | "Amount" | "Description"| "Invoice" | "ID"
"2019/09/23" | "$933.02" | "Sales of supplies" | "425342353" | "1" 
 "2019/09/24" | "$375.84" | "Cabling Service" | "234243" | "2" 
 "2019/09/26" | "$-583.73" | "Payment of: Water, Electricity, Telephone" | "322525" | "3" 
 "2019/09/28" | "$739.98" | ",Auto consumption" | "235234" | "4" 

In this way the delimiters within the quotation marks are omitted to keep the information consistent and can be read correctly.

In the last line:
2019/09/28′ | '$739.98' | ',Auto consumption'| '235234' | '4'.
We note that there is a comma within ',Auto consumption'.. For these cases it is necessary to apply a cleaning of information so that these characters that have no reason to be, disappear.