REgex is an abbreviation for Regular Expression. REgex is a character pattern that is used to search for a specific text or a portion of text.
The purpose of the regex is usually to search for text matches and/or make replacements on this text. When working with a large text document or an extensive log file. REgex allows you to quickly search for the pattern you need.
The first thing we need to understand is that every letter, number and symbol is a character. A string is made up of several characters to form words or keys.
In REgex different symbols are used to identify the pattern we are looking for.
\d allows us to search for any number from 0 to 9.
\D will Match with any character other than a number.
In the image on the left the Regex matched the first number it found. In the middle, with 3 numbers in a row. And on the right, with the first character that was not numerical.
The tool allows to make tests in real time, selecting the text that coincides with the regex expression that we put as input.
Regex has several text identification tools. The important thing is to understand their syntax. There are several ways to search for the same character set. And it is up to each one to find the easiest way to do it.
Since in Regex the symbols on the keyboard have their own meaning, it is imperative to use an 'escape character' to be able to search for a symbol. If our expression looks for a point (.) we need to use the following syntax \. Otherwise the dot (.) itself is used in regex to search for 'any character'.
Another important example is inclusive and exclusive groups. This is done by using straight keys [ ].
Likewise the symbol (^) to deny a data set. Taking into consideration the previous example:
Below is a list of the syntax along with a link to a page that provides REgex exercises for practice.
When working with historical information from one or more older systems, which I will call as Legacys, there is a good chance that the information stored in these systems has some kind of error.
Occasionally, the user can save information with formatting errors, special characters, or any value that the system should validate before saving but is allowed to pass. On the other hand, it can happen that the stored information wants to be extracted to use it as reports, but, at the moment of extracting it, it can be noticed how the existence of certain characters within the information cause reading, loading and reporting errors.
When exporting a given range of columns from Legacy for use, it may be the case that the number of columns in the file does not match the number of items in the file header. For example:
Sale of supplies
Payment of: Water, Electricity, Telephone
, Auto Consumption
In this table we can see how line number 3 and 4 present the comma symbol ( ,) within their information. If this file is exported for reporting or auditing purposes to a .csv format or a .txt format, a delimiter would be used, this is known as a plain text file.
In the case where the delimiter is a comma to separate each field. The original file would be read.
Let's take a look at the following information line: 2019/09/26 , $-583.73 , Payment of: Water, Electricity, Telephone , 322525 , 3
As we can see, the comma is inside the description field, this field can be typed by a user manually and the comma is present to separate elements. The problem within this scenario arises when you want to read this information.
The programs that read and use this type of files ask to define the delimiter with which they must separate this long list of characters. If it is configured with a comma, it would read as follows:
Sale of supplies
Payment of: Water
The comma found within the information is read as a delimiter, causing the description column to run to the right and causing what is known as a Inconsistency of columns this causes the readers to show errors or when loading the information, its integrity is affected.
To solve this type of problems 2 possible solutions can apply. In the first instance you can use a delimiter that is very unlikely to be found in the information. For example a vertical bar ( | ). The data would be read:
In this way the delimiters within the quotation marks are omitted to keep the information consistent and can be read correctly.
In the last line: ‘2019/09/28′ | '$739.98' | ',Auto consumption'| '235234' | '4'. We note that there is a comma within ',Auto consumption'.. For these cases it is necessary to apply a cleaning of information so that these characters that have no reason to be, disappear.