Data life cycle. Data collection and systematization

Data life cycle. Data collection and systematization

Data represent the method of representation, storage and elementary operations of information processing. Data is the basis of information. The term “data” is a relatively new term. Typically, data is input to the information process.

Data-information needed to draw conclusions and make decisions

Data as substance or energy, it is possible to collect, process, store, change the form of their presentation. They can be created, destroyed, reused. The main feature of the data today is that they are becoming extremely much. With the massive use of computers there was a huge number of data sources. For example, you can take the amount of data on the world wide web, which increases every minute.

The key concept of data manipulation is the structure of the “file” type, which is a set of similar elements (records). Also, the file occupies a certain area on memory media and is characterized by name, type and other attributes. In turn, a record is a structure that consists of fields (minimal data structure).

The main stages of the data life cycle are generation, storage, use and destruction. Destruction, in terms of the data life cycle, is of no interest, as the reason for deletion is the loss of information content. The data usage phase includes three phases:

• o search;

• processing;

• analysis.

The result of using the data is information.

There are several methods of collecting data required for analysis:

1. Accounting system. As a rule, accounting systems have mechanisms for reporting and exporting data, so obtaining the necessary information is a relatively simple operation.
2. Indirect data. On the factors can be estimated and indirect signs. For example, the real financial situation of the inhabitants of a certain region can be estimated as follows. In most cases, goods with the same purpose (but different price) are divided into groups: goods for buyers with low income, medium and high. If we analyze the report on sales of goods in the desired region from the point of view of the proportional distribution of the amount of sales for each category of income of buyers, we can assume that the greater the share of the sale of expensive products from one product group, the greater the average payment capacity of residents of this region.
3. Open source. A large amount of data is available in open sources, such as statistical samples, corporate reports, published results of marketing research and the like.
4. Conduct independent marketing research and similar data collection activities. This can be quite expensive, however, this option of data collection is not excluded.
5. Internal data. Information is entered into the database for all sorts of expert assessments by employees of the organization. Time-consuming method.

The collected data is converted to a single format, such as Excel spreadsheets, text files, or arbitrary database components. One important step is to determine how to present the data. As a rule, choose one of the following types – number, string, date, logical variable (Yes/no). It is easy to determine the way of representation (formalization) of some data – for example, the volume of sales in rubles is a certain number. But, as a rule, there is a situation when the representation of the factor is unknown. Most often such problems arise with qualitative characteristics. For example, it is known that sales volumes are affected by the quality of the product (as for the sale of household appliances or clothing).

Quality is a complex concept, and if this indicator is important, it is necessary to introduce a way to formalize it. For example, to determine the quality by the number of defects per thousand units of production, or to assess expertly, dividing into several categories – excellent/ good/ satisfactory/ bad.

Also, the data should be unified – the same data should be described the same everywhere. Often, knowledge mining focuses on data analysis mechanisms without taking into account the importance of data pre-processing and cleaning. Obviously, incorrect source data leads to incorrect conclusions. Note that in most cases, the source of information for analytical systems is the data warehouse, which accumulates information from heterogeneous sources, so the severity of the problem increases significantly.

To study the processes of different nature, the data should be prepared in a special way. Let’s take a closer look at two types of data: ordered and unordered. Ordered data is needed to solve forecasting problems – when determining the course of a process in the future on the basis of available historical data. As a rule, one of the parameters is the date or time, but arbitrary counts can be used, for example, meter readings taken at certain intervals.

Leave a Reply

Your email address will not be published. Required fields are marked *