Problemas con delimitadores dentro de archivos planos

Problemas con delimitadores dentro de archivos planos

When working with historical information from one or several old systems, which I will call Legacy, there is a great possibility that the information stored in these systems may have some kind of error.

Sometimes, the user may save information with errors, whether they are formatting errors, special characters or any value that the system should validate before saving but is allowed to pass. On the other hand, it may happen that the stored information wants to be extracted to be used as reports, but, at the moment of extracting it, it can be noticed how the existence of certain characters within the information causes reading, loading, and reporting errors.

When exporting a given range of columns from a Legacy for use, it may be the case that the number of columns in the file does not match the number of elements in the file header. For example: 

Fecha Monto Descripción Factura ID
2019/09/23 $933.02 Venta de insumos 425342353 1
2019/09/24 $375.84 Servicio de Cableado 234243 2
2019/09/26 $-583.73 Pago de: Agua, Luz, Teléfono 322525 3
2019/09/28 $739.98 , Auto consumo 235234 4

In this table, we can see how lines number 3 and 4 present the comma symbol ( ,) within their information. If this file were to be exported for reporting or auditing purposes to a .csv or .txt format, a delimiter would be used, this is known as a plain text file.

In the case where the delimiter is a comma to separate each field. The original file would read.

Date , Amount , Description , Invoice ,ID
2019/09/23 , $933.02 , Sale of supplies , 425342353 , 1
2019/09/24 , $375.84 , Wiring Service , 234243 , 2
2019/09/26 , $-583.73 , Payment of: Water, Electricity, Telephone , 322525 , 3
2019/09/28 , $739.98 , , , Self consumption , 235234 , 4

Let’s take main attention on the following line of information:
2019/09/26 , $-583.73 , Payment of: Water, Electricity, Phone , 322525 , 3

As we can see, the comma is inside the description field, this field can be typed by a user manually and the comma is present to separate elements. The problem in this scenario arises when trying to read this information.

The programs that read and use this type of file ask to define the delimiter with which they must separate this long list of characters. If it is configured with a comma, it would be read as follows: 

 

Date Amount Description Invoice ID

2019/09/23 $933.02 Sale of supplies 425342353 1

2019/09/24 $375.84 Wiring Service 234243 2

2019/09/26 $-583.73 Payment of: Water Electricity Telephone 322525 3
2019/09/28 $739.98
Auto Consumption 235234 4


The comma found within the information is read as a delimiter, causing the description column to run to the right and causing what is known as a column inconsistency, which causes readers to display errors or when loading the information, its integrity is affected.

To solve this type of problem 2 possible solutions can be applied. First, you can use a delimiter that is very unlikely to be found in the data. For example a vertical bar ( | ). The data would read:

Date | Amount | Description | Invoice |ID
2019/09/23 | $933.02 | Sale of Supplies | 425342353 | 1
2019/09/24 | $375.84 | Cabling Service | 234243 | 2
2019/09/26 | $-583.73 | Payment of: Water, Electricity, Telephone | 322525 | 3
2019/09/28 | $739.98 | ,Self Consumption | 235234 | 4

In this way we would avoid separation within the description. The second recommendation is based on enclosing each column between single quotation marks (‘ ‘) or double quotation marks.


“Date” | “Amount” | “Description”| “Invoice” | “ID”.
“2019/09/23” | “$933.02” | “Sale of supplies” | “425342353” | “1”
“2019/09/24” | “$375.84” | “Wiring Service” | “234243” | “2”
“2019/09/26” | “$-583.73” | “Payment of: Water, Electricity, Telephone” | “322525” | “3”
“2019/09/28” | “$739.98” | “,Self consumption” | “235234” | “4”

In this way the delimiters inside the quotation marks are omitted to keep consistency of the information so that it can be read correctly.

In the last line:
‘2019/09/28’ | ‘$739.98’ | ‘,Auto consumption’| ‘235234’ | ‘4’
We notice that there is a comma inside ‘,Auto consumption’. For these cases it is necessary to apply an information cleanup so that these characters, which have no reason to exist, disappear.

 

error: Content is protected.