This section is divided into three sub-sections in which the process of obtaining the data necessary to carry out the rest of the procedure is described. The first details the steps carried out to obtain the data from the companies and select those with which they worked in the rest of the procedure. The second sub-heading presents a brief explanation of the computed indicators that will be used as input variables, together with the historical profitability values of the companies selected in the first sub-heading. The third sub-heading exposes the procedure carried out for the creation of the input and output vectors from the data resulting from the second sub-heading.

2.4.1 Data Collection

A more detailed explanation regarding the code used to carry out the procedure described in this sub-heading can be found in Annex. 4 - Data Collection.

In order to exemplify how artificial neural networks and quadratic programming can be used in a portfolio management strategy, it was decided in this paper to use data from the Spanish market. Therefore, it was decided to work with the information corresponding to the companies that are in the list of listed companies that is exposed in “Empresas Cotizadas” (n.d.) and can be seen in Table 2.

The Table 2 collects the data of r nrow(empresas) companies. The data collected being the name, ticker, sector and subsector, market, index of each of the companies and whether or not they were selected to carry out the rest of the procedure after carrying out the steps set out in this sub-heading.

In order to obtain the data from the companies and analyse them to select those with which we worked in the rest of the procedure, (n.d.a) was used as a source. Next, the process carried out to obtain and select the data is explained.

It was decided to download the monthly data of each of the companies collected in Table 2. Obtaining all the data between January 31, 2000 to February 28, 2023 of each of the entities.

After obtaining the data, the quality of the data was evaluated. The evaluation began with a visual exploratory analysis of adjusted prices since, as explained in the previous chapter, these are the ideal ones to use in any historical analysis methodology.

During the aforementioned visual exploratory analysis, it was detected that there were irregularities in the adjusted prices of some of the series. The irregularities detected consisted of the incorrect registration of the adjusted prices, as well as errors in their calculation. These errors were easily detected by looking at the graphs of the constant trend adjusted closing price values over long periods of time, as seen in Figure 6, which indicates misrecording of price changes; as well as sudden changes of up to more than 100% in them in a single period of time, which may indicate a miscalculation in the adjusted price, as seen in Figure 7, in this last case it was verified with other sources like (n.d.b), to verify that the prices were indeed miscalculated.

Given the time available to carry out the study described in the procedure and the extensive amount of time that the investigation would require to be carried out to replace the erroneous values ​​in the series, it was decided to eliminate these irregularities by using only the values after January 2005, which no longer presented inconsistencies in the calculation of the adjusted price, subsequently those series that still contained missing values and that presented irregularities in the registration of variations were eliminated. For the latter, those series in which the variations of unrecorded prices are in more than 10 observations.

Remaining after the adjustments made r length(returns_emps3) companies, as seen in the selected column of the Table 2, some of these companies have different numbers of observations, because not all of them existed or had gone on the market stock market before January 2005.

Once the companies with which we worked were selected, their returns were computed from the adjusted prices. In addition to the returns corresponding to the selected companies, the returns of the adjusted closing price of the IBEX 35 were used, as well as other variables that serve as indicators of the behaviour of the returns, and their relationship with those of the index, in this case those of the IBEX 35. These variables include the volatilities of the companies and the index, the correlation between the values of the series and the IBEX, and the beta of the companies in relation to the IBEX.

2.4.2 Indicators

This sub-heading presents a brief explanation of the computed variables to be used as input variables in conjunction with the historical values of the companies’ returns. A more detailed explanation regarding the code used to carry out the procedure described in this sub-heading can be found in Annex. 4 - Indicators.

2.4.2.1 Volatility

Building on Hargrave (2023) and Hayes (2023), standard deviation and volatility are two related concepts that measure how much the price of a stock or other asset fluctuates over time. Standard deviation is a statistical term that quantifies the spread of a set of data points around its mean value. Volatility is a financial term that describes the degree of variation in the returns of an asset over a given period of time.

Standard deviation and volatility are important in stock market analysis because they indicate the risk and uncertainty associated with investing in a particular asset. A high standard deviation or volatility means that the price of the asset can change significantly in either direction, implying greater potential for profit or loss. A low standard deviation or volatility means that the asset’s price is relatively stable and predictable, which means less potential for profit or loss Hayes (2023).

To calculate the volatility of a stock or index, the standard deviation of returns is calculated. Therefore, the necessary calculations are those shown below in the Equation 1.

\[R_i = \frac{P_i - P_{i-1}}{P_{i-1}}\]

\[\sigma = \sqrt{\frac{\sum_{i=1}^N (R_i - \bar{R})^2}{N} } \tag{1}\]

where:

  • \(R_i\) is the return of the stock in the period \(i\).

  • \(P_i\) y \(P_{i-1}\) are the prices of a share in time periods \(i\) and \(i-1\), respectively.

  • \(\sigma\) is the standard deviation - \(N\) is the number of observations.

  • \(\bar{R}\) is the average return on the stock.

Standard deviation and volatility are useful tools for investors and analysts to assess the risk-reward balance of different assets and portfolios. They can also help compare the performance of different assets and portfolios over time and under different market conditions.

2.4.2.2 Correlation

As Edwards (2022) explains, correlation is a statistical measure that determines how two variables move relative to each other. In stock market analysis, correlation can help understand the behaviour of different stocks or market indicators over time. Taking the data used in this paper as an example, if the prices of one of the selected companies tend to go up and down together with the IBEX 35, these prices have a positive correlation. If, on the contrary, the company’s prices tend to rise when the IBEX 35 indicator falls, they have a negative correlation. A correlation coefficient of zero means that there is no linear relationship between the variables, being in this case the values of the IBEX 35 and the prices of one of the determined companies.

As exposed by Ross (2022) the correlation between two variables is calculated using the following equation, Equation 2:

\[\rho_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \tag{2}\]

where:

  • \(\rho_{xy}\) is the correlation coefficient.

  • \(n\) is the number of observations.

  • \(x_i\) y \(y_i\) are the values of the two variables for the \(i\) observation.

  • \(\bar{x}\) y \(\bar{y}\) are the means of the two variables.

As also explained by Edwards (2022), the correlation coefficient ranges from -1 to 1, where -1 indicates perfect negative correlation, 1 indicates perfect positive correlation, and 0 indicates no correlation at all. Being able to understand that the closer the correlation coefficient is to both -1 and 1, the stronger the linear relationship between the variables analysed.

As previously explained, the correlation coefficient, in this paper, can be used to analyse how similarly the returns of a company move compared to those of the IBEX 35. The correlation can also be used to diversify a portfolio by choosing stocks that have a low or negative correlation with each other, as explained by Boyte-White (2022). This can help reduce overall portfolio risk, as losses from one stock can be offset by gains from another. However, the correlation is not constant and can change over time due to various factors, such as market conditions, economic events, or company news. Therefore, it is important to monitor the stock correlation regularly and adjust the portfolio accordingly Boyte-White (2022).

Correlation is a valuable tool in stock market analysis, but it does not imply causation. Having a high or low correlation between two variables does not imply that one variable causes change in the other. Correlation simply measures the strength and direction of the linear relationship between two variables, without considering other factors that may influence them.

As also exposed in Edwards (2022), the correlation is closely related to the volatility of the market and of the shares, being able to see that, during periods of greater volatility, such as the financial crisis of 2008, the shares can tend to be more correlated, even if they are in different sectors. International markets can also become highly correlated during times of instability. Investors may want to include assets in their portfolios that have low market correlation to equity markets to help manage their risk.

2.4.2.3 Beta

As explained by Kenton (2022) Beta is a measure of how sensitive a stock’s returns are to changes in market returns. It is calculated as the slope of the regression line that fits historical stock and market returns. A beta of 1 means the stock is moving in sync with the market, a beta greater than 1 means the stock is more volatile than the market, and a beta less than 1 means the stock is less volatile than the market.

Beta is important in stock market analysis because, as Kenton (2022) explains, it helps investors assess the risk and return of a portfolio. By knowing the beta of each stock in a portfolio, investors can estimate how much the portfolio will fluctuate with market movements and adjust their asset allocation accordingly. For example, if an investor wants to reduce the risk in their portfolio, they might choose stocks with low or negative beta values that tend to move in the opposite direction of the market.

As explained by Monaghan (2019) Beta is related to mapping, but they are not the same. As explained above, correlation is a measure of how linearly related two variables are, Beta, on the other hand, is a measure of how strongly related two variables are, indicating how much one variable changes when another variable change by one unit. Beta can be calculated from the correlation using the following equation, Equation 3:

\[\beta = \frac{\rho_{xy} \sigma_x}{\sigma_y} \tag{3}\]

where:

  • \(\rho_{xy}\) is the correlation coefficient between \(x\) and \(y\)

  • \(\sigma_x\) is the volatility of x

  • \(\sigma_y\) is the volatility of y

2.4.3 Vectors

In this sub-section, the procedure carried out to create the input and output vectors from the data resulting from the procedure described in the previous sub-section is explained. A more detailed explanation regarding the code used to carry out the procedure described in this sub-heading can be found in Annex. 4 - Vectors.

The structure of the set of input and output vectors is of vital importance in the modelling of ML techniques, having a significant impact on its effectiveness. The set of vectors must be created in a representative way of the problem to be solved, so the steps described below explain in detail the aspects of the problem to be answered in this work and how to shape the set of input and output vectors to it.

As previously mentioned, the objective of this paper is to present a procedure for the use of RNA models and quadratic programming in an investment strategy. The modelling addresses the need to obtain the most accurate predictions possible so that later, based on the predictions and historical data, find the ideal portfolio composition. Therefore, the problem to be represented with the sets of input and output vectors is how to explain the behaviour of the profitability of a company at an instant of time \(i+1\) with the values of several variables at the instant of time \(i\).

To represent this problem, three-dimensional vectors were created, following what was exposed in Chollet and Allaire (2018). The dimensions of these vectors are explained as follows:

  • The first dimension is comprised of the number of samples obtained by sectioning the observations of the different series into consecutive two-dimensional vectors.

  • The second dimension is comprised of the number of observations, of the different series, collected in each two-dimensional vector.

  • The third dimension is the amount of series in each two-dimensional vector.

Therefore, in order to correctly obtain these samples, it is necessary to first define which series will be used for the input and output vectors. The series used in the input vectors were defined in the previous section, being these: the historical returns of the company and the IBEX, the historical volatilities of the company and the IBEX, the historical correlation of the company and the IBEX, and the Historical beta of the company and the IBEX. The series used for the output vectors is the historical profitability of the company.

Subsequently, the time horizon to be foreseen was defined, this is a key aspect in the creation of the sets of inputs and outputs. The number of observations defined as the time horizon determines the observations of the output vectors, in the present work an observation was determined as the time horizon since it is desired to predict the profitability of the next month of the different selected companies.

And the last aspect to define is how many observations the model must observe to infer the desired output. This defines the number of observations that will be taken from each time series to form the input vectors. To determine this aspect, an iterative process must be carried out, testing different quantities and evaluating the results obtained by the models that are trained with them. To simplify the process, in the present work it was determined to test different input sizes, these being 1, 2 and 3 observations. Thus, testing in a certain way how the size of the inputs affects the prediction obtained.

If we have an array for the input vectors contains some r dim(returns_indc[[1]])[1] observations, we can calculate the number of samples obtained from this array following the following equation, Equation 4:

\[ m = n - (i-1+o) \tag{4}\]

where:

  • \(m\) the number of samples.

  • \(n\) the number of observations in the series.

  • \(i\) y \(o\) the number of observations in the input and output vecotres respectively.

The Table 3 shows the number of samples obtained for the different sizes of input vectors proposed, for which the different numbers of observations of the selected r length(returns_indc) were considered. The Figure 16 shows what the input and output vectors look like, in the case in which the input vector has 3 observations.