Renewable quantile regression for streaming data sets

Online updating is an important statistical method for the analysis of big data arriving in streams due to its ability to break the storage barrier and the computational barrier under certain circumstances. The quantile regression, as a widely used regression model in many ﬁelds, faces challenges in model ﬁtting and variable selection with big data arriving in streams. Chen et al. (2019, Annals of Statistics) has proposed a quantile regression method for streaming data, but a strong additional condition is required. In this paper, renewable optimized objective functions for regression parameter estimation and variable selection in a quantile regression are proposed. The proposed methods are illustrated using current data and the summary statistics of historical data. Theoretically, the proposed statistics are shown to have the same asymptotic distributions as the standard version computed on an entire data stream with the data batches pooled into one data set, without additional condition. Both simulations and data analysis are conducted to illustrate the ﬁnite sample performance of the proposed methods.


Introduction
The concept of ''big data" may have different meanings to people from different fields and has since become a dominant topic in nearly all academic disciplines and in applied fields.In a broad sense, big data are data on a massive scale in terms of volume, variety, velocity, variability and veracity [14].Applying statistical models and methods to such big data can cause excessive computational burden not only in terms of strains on computer memory due to the large volume but also strains in terms of computational efficiency since even seemingly very simple tasks can take an inordinate amount of time to compute.In a recent review, [31] grouped the statistical and computational methodologies into three categories: subsampling-based approaches (e.g., [35,40; 34]), divide and conquer approaches (e.g., [23]; [18]; [1]; [16]; [17]), and online updating approaches (e.g., [27]; [24]; [28]; [26]).Online updating approaches are distinct from the other two because they target problem where data arrive in streams or large chunks and address statistical problems in an updating framework without storage requirements for previous data.Because the demand for stream processing is increasing ([4; 6] and among others), which makes online updating particularly appealing due to its ability which processes huge volume of data at speed so that organisations or businesses can react to changing conditions in real-time.
Our focus in this paper is a streaming data set that arrives in streams.Assume we have the streaming data set D 1 ; . . .; D b f gup to the b-th batch, where D j is the j-th batch data set with a sample size of n j .Then, the total sample size is N b ¼ P b j¼1 n j .In the era of big data, streaming data sets from various areas, such as bioinformatics, medical imaging and computer vision, are rapidly increasing in volume and velocity.This presents challenges to learning efficient statistical models and inferences.How to make statistical inferences without storage requirements for previous raw data is the key in the streaming data environment.When large amounts of data arrive in streams, online updating is an important method to alleviate both computational and data storage issues.In this framework for regression-type analyses, [27] developed onlineupdating algorithms for linear models and estimating equations.The estimation consistency of these methods has been established based on a strong regularity condition: the total number of streaming data sets b, needs to satisfy the order of b ¼ O n c c < 1=3 for all js, where n j is the size of the j-th data batch.This is a very strong restriction.For example, the estimation consistency may not be guaranteed in the situation where streaming data sets arrive perpetually with b ! 1. [24] proposed a renewable estimation for the generalized linear model, which overcame the above unnatural restriction.[22] introduced a general framework of renewable weighted sums for various online updating estimations.Regarding other references, [32] expanded the scope of the online updating method by accommodating the arrival of new predictor variables mid-way along the data stream.[37] developed an online updating method for survival analysis under Cox proportional hazards models, and [39] proposed an online updating-based test to evaluate the proportional hazards assumption.[21] developed an update estimator for a linear errors-in-variables model.
The quantile regression (QR) [19], which analyzes the conditional distribution of outcomes given a set of covariates and has been widely used in many fields, faces challenges in model fitting and variable selection with big data arriving in streams.However, the above methods for streaming data based on least squares and estimating equations will no longer be applicable.In this paper, we consider the following quantile regression: where q s YjX ¼ x ð Þ¼inf y : P Y 6 yjX ¼ x ð Þ P s f g , X is a random vector of p-dimensional covariates, and b 0 is a vector of unknown parameters of interest.Note that b 0 should truly be b 0 s ð Þ, and we omit the subscript s for notational convenience.
If the data up to the b-th batch of streaming data can be pooled into one data set, we denote n ¼ N b , and let X i ; Y i f g n i¼1 be the independent and identically distributed (i.i.d.) samples from Y; X ð Þ2 R Â R p .The standard quantile regression estimator [20] solves the following minimization problem: where q s r ð Þ ¼ sr À rI r < 0 ð Þis the check loss function, and I Á ð Þ is the indicator function.Note that by using (1.2), we find that the above methods for streaming data based on the least squares and estimating equations are not suitable for the QR because the quantile regression estimator has no display expression like the least squares estimator and the loss function of the quantile regression is not differentiable, even though loss function needs to be second-order differentiable in the estimation equation.
To circumvent the nondifferentiability of the QR loss function, [15] proposed smoothing the indicator part of the check function via a kernel smoothing survival function.Recently, [2] applied Horowitz's smoothing quantile regression for streaming data.This is interesting; however, their method requires that the sample size of the j-th batch n j ! 1 and n j be approximately n 2 jÀ2 .This implies a very strong restriction.For example, n 1 ¼ 100; n 3 ¼ 10 4 ; n 5 ¼ 10 8 ; n 7 ¼ 10 16 ; n 9 ¼ 10 32 ; Á Á Á. [36] also considered QR for streaming data.However, their method requires that b= ffiffiffiffiffi ffi N b p ! 0, which means the number of batches b can not very large, and for achieving the same asymptotic covariance matrix as that of the estimator with full data, the covariates of each batch are homogeneous.The smoothing method proposed by [15] is smoother at the cost of convexity, which inevitably raises optimization issues.Generally, computing a global minimum of a nonconvex function is intractable, but convolution-type smoothing can be obtained [11], which yields a convex and twice differentiable loss function, a lower mean squared error than that of the estimator in [15] and a more accurate Bahadur-Kiefer representation than the standard QR estimator.In this paper, in contrast to [2], we propose a renewable method for quantile regression via a smoothing objective function.
The proposed method does not require any specific condition as that in [11] as long as each batch sample size is sufficiently large (n j !1; j ¼ 1; . . .; b).
Variable selection plays an important role in the model building process.In practice, it is common to have a large number of candidate predictor variables available.A major challenge in regression analysis is to decide which predictors, among many potential predictors, are to be included in the model.Several methods, including the least absolute shrinkage and selection operator (LASSO) [30], smoothly clipped absolute deviation (SCAD) [8], adaptive LASSO [42], and minimax convex penalty (MCP) [41], have been proposed to select variables and estimate their regression coefficients simultaneously.Several algorithms were developed for variable selection in models with streaming data sets.[7] applied the truncated stochastic gradient descent (SGD) to a linear model.[29] introduced a novel framework, which combines updated statistics and truncation techniques, for variable selection in a linear model.These SGD algorithms and truncation techniques, however, are sensitive to the learning rate or step size and tend to select the set with larger cardinality to include all important variables.[5] considered a class of online estimators in a highdimensional autoregressive model.[28] proposed an inference procedure for high-dimensional linear models via recursive onlinescore estimation.In both works, it is assumed that the entire data set is available at the initial stage for computing an initial estimator and that the information in the streaming data is used to reduce the bias of the initial estimator.However, the assumption that the full data set is available at the initial stage is not realistic for analyzing streaming data sets.[12] proposed an online debiased LASSO method for high-dimensional linear models with streaming data sets based on the least squares method.[25] studied a general framework for online updating variable selection in a generalized linear model with streaming data sets.In this paper, we also study a renewable variable selection method for quantile regression.
To summarize, we make the following important contributions to the existing literature.(1) We develop a renewable estimation and algorithm for the quantile regression that only requires the availability of the current data batch in the data stream and sufficient statistics of the historical data at each stage of the analysis.Theoretically, the proposed estimator achieves optimal efficiency and its asymptotic covariance matrix is the same as that of the estimator with full data without additional condition.(2) We study a renewable optimized objective function for variable selection in a quantile regression.The proposed method only requires the availability of the current data batch in the data stream and sufficient statistics of the historical data at each stage of the analysis.In order to realize a numerical solution, we introduce an efficient algorithm for this optimization problem.Moreover, the proposed method can choose tuning parameters via a data-driven and online updating BIC criterion.Theoretically, the proposed estimator achieves the same consistency and oracle properties as the estimator based on the entire data set under some general conditions.(3) The proposed renewable methods are all free of the constraint on the number of batches, which means that the new methods are adaptive to

Table 1
The differences between our proposed methods and the previous methods in the references.

Method
Additional conditions [27] Estimating equations Estimating equations None [2] Quantile regression n j is approximately n the situation where streaming data sets arrive fast and perpetually.Finally, Table 1 shows the differences between our proposed methods and the previous methods in the references for linear model.The remainder of this paper is organized as follows.In Section 2, the renewable smoothing QR estimator is proposed.The renewable variable selection method is developed in Section 3.Both simulation examples and the application on real data are given in Section 4 to illustrate the proposed procedures.We conclude this paper with a brief discussion in Section 5.All technical proofs are provided in the Appendix.

Standard smoothing quantile regression
Suppose that the batches up to the b-th batch of streaming data can be pooled into one data set.Note that for a quantile regression, the loss function q s r ð Þ ¼ sr À rI r < 0 ð Þis nondifferentiable.Therefore, the QR estimator has no display expression, so it is impossible to construct a renewable estimator for streaming data.To circumvent the nondifferentiability of the QR loss function, the QR estimator of b 0 in the model (1.1) can be solved by minimizing the following smoothing quantile regression (SQR) objective function [11] ð Þ is twice continuously differentiable with the gradient and Hessian matrix

Smoothing quantile regression for streaming data sets
For model (1.1), D j ¼ X j ; Y j È É is the j-th batch data set, where Y j ¼ Y 1;j ; . . .; Y n j ;j > and X j ¼ X 1;j ; . . .; X n j ;j > .We suppose that the X i;j ; Y i;j À Á for all is and js are i.i.d.samples from Y; X ð Þ.We begin with a simple scenario of two batches of data D 1 and D 2 , where D 2 arrives after D 1 .We want to update the initial SQR b1 (or bÃ 1 ) by minimizing (2.1) to a renewed SQR bÃ 2 without using any subjectlevel data but only some summary statistics from D 1 .By (2.1) and (2.2), the SQR b1 satisfies, the sample size of D 1 .Then, bÃ 2 satisfies the following aggregated score equation: Solving Eq. (2.4) for bÃ 2 actually involves the use of subject-level data in both D 1 and D 2 , where D 1 may no longer to accessible.Our renewable estimation is able to handle this issue.To derive a renewable estimate, by the (A.10) in the Appendix, we can obtain ð2:5Þ where O p Á ð Þ means bounded with probability.We take the firstorder Taylor series expansion of the U D 1 ; bÃ 2 ; h 1 around b1 , ð2:6Þ . By (2.3), (2.5) and (2.6), we have ð2:7Þ By placing (2.7) into (2.4),we obtain 1 ð2:8Þ When n 1 is sufficiently large, under some mild regularity conditions, both b1 and bÃ 2 are consistent estimators of b 0 .Moreover, taking sufficiently small bandwidths h 1 and h 2 , the error term n o may be asymptotically ignored.
Removing such a term, we propose a new estimator b2 as a solution to the equation of the form Through Eq. (2.9), the initial b1 is renewed by b2 only using the historical summary statistics, including sample variance matrix J D 1 ; b1 ; h 1 and estimate b1 , instead of the subject-level raw data Generalizing the Eq.(2.9) to streaming data sets D 1 ; . . .; D b f g , a renewable estimator bb of b 0 is defined as a solution to the following incremental estimation equation:

Large sample properties
To establish the asymptotic properties of the proposed estimator, the following technical conditions are imposed.
C1.The kernel function K Á ð Þ is even, integrable, and twice differentiable with bounded first and second derivatives such that , it satisfies condition C1.Condition C2 is a regular condition on the smoothness of the conditional density function f yjx ð Þ.The conditions C2 and C3 ensure that X in Theorem 2.1 is positive definite, which means that X À1 exists.Conditions C1-C3 are standard conditions, which are commonly used in smoothing quantile regression, as shown in [11].
Theorem 2.1.Assume that conditions C1-C3 are satisfied.If where !L represents the convergence in the distribution and Through the result of Theorem 2.1, it is interesting to notice that the renewable estimator bb achieves optimal efficiency and its asymptotic covariance matrix is the same as that of the SQR estimator bÃ b which is computed directly on all the samples, as shown in Theorem 5 in [11].This implies that the proposed renewable estimator achieves the same asymptotic distribution as the SQR estimator.

Algorithm
Numerically, it is quite straightforward to find bb from (2.10) using the Newton-Raphson method at ð2:12Þ In algorithm (2.12), clearly we only use the subject-level data of current data D b and summary statistics b J bÀ1 and bbÀ1 from historical data batches up to b À 1 rather than subject-level raw data of D 1 ; . . .; D bÀ1 f g .Thus, our proposed renewable method is indeed an online estimation procedure.We summarize the general algorithm for the proposed renewable SQR method by (2.12) as follows.Note that in step 7 in Algorithm 1, we only need to save bb and b J b , which are p Â 1 and p Â p, respectively.The scale of the data to be stored is p þ 1 ð Þp instead of N b p, which is the sample size of the streaming data sets up to b batches.Because p is assumed to be a fixed number in this paper, our method greatly reduces the amount of data storage.

Variable selection based on all data
To avoid overfitting and improve the generalization ability, we first consider the penalized SQR (PSQR) based on all data (the batches up to the b-th batch of streaming data can be pooled into one data set): where p k Á ð Þ is a penalty function with regularization parameter k.Among the various penalty functions, we consider the SCAD because of its properties of unbiasedness, sparsity and continuity.The SCAD penalty function is defined through its first-order derivative and symmetry around the origin.For any h > 0, where a > 2. a ¼ 3:7 was suggested by [8] from a Bayesian perspective and is commonly used in the variable selection literature.

Variable selection based on streaming data sets
We begin with two batches of data D 1 and D 2 .We want to update the initial PSQR b1 (or bÃ 1 ) to a renewed PSQR bÃ 2 without using any subject-level data and using only some summary statistics from D 1 .

b1 ¼ arg min
Here, the PSQR b1 also satisfies where sign Á ð Þ is the sign function.Then, bÃ 2 satisfies the following aggregated score equation: 1 By similar analysis in Section 2.2 and (3.2), we can obtain 1 where R is an asymptotically ignored error term.By substituting (3.4) into (3.3),we have 1 Removing the asymptotically ignored term R, we propose a new estimator b2 as a solution to the equation of the form 1 Through Eq. (3.5), the initial b1 is renewed by b2 using statistics J D 1 ; b1 ; h 1 À Á ; b1 and k 1 instead of D 1 .Generalizing the above procedure to streaming data sets D 1 ; . . .; D b f g , a renewable penalized estimator bb of b 0 is defined as a solution to the following incremental estimating equation: The following theorems show the consistency and oracle properties of the estimator bb in (3.6) or (3.7).
Theorem 3.1.(Consistency).Suppose that conditions in Theorem 2.1 hold and k j !0 with j ¼ 1; . . .; b.Then, Theorem 3.1 demonstrates that our estimator bb is root-N b consistent.The following theorem shows that bb has the oracle property proposed in [8].Without any loss of generality, we assume that the first s elements of b 0 are nonzero and the last p À s elements are zero.That is, b 0 can be written as b where b b1 ð Þ and b b2 ð Þ are the first s and the last p À s elements of bb , respectively.X 11 ð Þ and R 11 ð Þ are the top-left s-by-s submatrix of X and R, respectively.
Theorems 3.1 and 3.2 show that the renewable estimator bb in (3.7) achieves the same consistency and oracle property as the estimator bÃ in (3.1), which is directly computed using all the samples (n ¼ N b ), as shown in Lemmas 1 and 2 in the Appendix.
Remark 3.1.The proposed renewable variable selection in (3.6) is different from the method in [25] for generalized linear models.Specifically, the theoretical derivations of the two methods are different, and there is no the term N bÀ1 =N b p 0 k bÀ1 j bbÀ1 j À Á sign bbÀ1 À Á in [25], which cannot be ignored.

Selection of regularization parameter
It is well known that the regularization parameter plays an important role in the penalized method.[33] verified that the SCAD penalized method with the regularization parameter selected by the Bayesian information criterion (BIC) can consistently identify the true model.Following [33], we use the BIC to choose the optimal value of the regularization parameters k in (3.1) and k b in (3.7).Specifically, the BIC statistics are defined as

Algorithm
This section is devoted to computational algorithm and numerical implementation.We focus on the algorithm for (3.7), and the algorithm for (3.1) is similar to that of (3.7).We describe a fast and easily implementable method using the local adaptive majorization-minimization (LAMM) principle [9].
As discussed in [8], the penalized function of SCAD is folded concavely with respect to b, making it difficult to maximize.We propose applying the adaptive local linear approximation to the penalty function of SCAD [10] and approximately solve b is the initial estimator.We can take bbÀ1 as an initial estimator.As stated by [9], the majorization requirement only needs to where . To find the smallest / such that , the basic idea of LAMM is to start from a relatively small isotropic parameter / ¼ / 0 ¼ 10 À6 [9], and then successfully inflate / by a factor x > 1.We set x ¼ 10 in the numerical studies.The isotropic form also allows a simple analytic solution to the subsequent majorized optimization problem:

Numerical studies
In this section, we first use Monte Carlo simulation studies to assess the finite sample performance of the proposed procedures and then demonstrate the application of the proposed methods with two real data analyses.All programs are written in R code.As mentioned in [13], the SQR method is insensitive to the choice of bandwidth h.In view of Theorem 2.1, we take h j ¼ N j ln N j À Á À1=4 for simplicity, j ¼ 1; . . .; b, and the Gaussian kernel À Á in all of the numerical experiments.
Some settings in the simulation are described below: s is the quantile level, p is the dimension of covariance, b is the number of batches, N b is the sample size of all data, Cases 1 and 2 are the two different errors settings, RLS, SQR, OLEQR and RSQR are the estimation methods, PLS, PQR, PSQR and RPSQR are the variable selection methods, MSE is the mean squared error of coefficient estimation, t is the computation time of estimation method, C is the average proportion of nonzero coefficients correctly estimated to be nonzero, IC is the average proportion of zero coefficients incorrectly estimated to be nonzero, MAFE is the mean absolute fitting error and MAPE is the mean absolute prediction error.

Simulation Example 1: renewable parameter estimation
In this section, we study the performance of the renewable SQR (RSQR) estimator proposed in Section 2. Furthermore, we include the following three competitors in our comparison: (1) the SQR estimator with full data, which can be obtained by the conquer algorithm in [13]; (2) the renewable least squares estimator (RLS) for the streaming data set, as given in [24]; and (3) the online linear estimator for the QR (OLEQR) for the streaming data set, as given in [2].
We generate data from the following linear model: all elements being one.F À1 e s ð Þ is the s-quantile of e, which is used to eliminate the influence of quantiles.Two error distributions of e are considered: a standard normal distribution (N 0; 1 ð Þ) and a t distribution with 3 degrees of freedom (t 3 ð Þ).In this section, we consider that the sample size of each batch is n.Then, the full data are N b ¼ nb, and we consider the following two cases: (2) We only present the results of s ¼ 0:5 for the QR because the computation times of different quantiles are similar.In terms of the computation time in Fig. 3, we note that (i) the computation time of all estimation methods is very small because the results of the computation times are very small; and (ii) the computation times of the three methods are close at p ¼ 10, and the computation time of the SQR is less than those of the RSQR and OLEQR.

Fixed N b with varying b
In this section, we fix the sample size of the full data N b ¼ 10 6 and vary the number of batches b ¼ 10; 50; 100; 200; 500; 1000; 2000 for s ¼ 0:1; 0:5 and 0:9.The simulation results are presented in Tables 2-5, and the following conclusions can be drawn: (1) In terms of the MSEs in Tables 2-4, we note that (i) all the estimators are close to the true value because the MSEs are very small; and (ii) for any given number of batches b; p and quantiles s, the tables show that the MSEs of the RSQR are very close to those of the SQR and better than those of the OLEQR.
(2) In terms of the computation times in Table 5, we note that (i) all estimation methods are very fast.(ii) When the data volume is not particularly large, there is little difference in the calculation/estimation time.Because OLEQR, as a one-round smoothing quantile regression method, is faster than our proposed method RSQR which is a multi-round smoothing quantile regression method.However, the computation times of RSQR are very close to those of OLEQR because our method RSQR needs fewer iterations to achieve convergence.(iii) Once the data size is large, RSQR obviously saves time.

Simulation Example 2: renewable variable selection
In this section, we study the performances of the PSQR estimator method and the renewable penalized SQR estimator (RPSQR) method proposed in Section 3. Furthermore, we include the following two competitors in our comparison: (1) the penalized least squares estimator (PLS) estimator with full data, which can be obtained by the ''ncvreg" function with the ''SCAD" method in the R package ''ncvreg"; and  The means and standard deviations (in parentheses) of the MSEs (Â100) under different batches b, methods and errors for simulation Example 1 with a fixed N b and s ¼ 0:5.(2) the penalized QR (PQR) estimator with full data, which can be obtained by the ''rq" function with the ''SCAD" method in the R package ''quantreg".
To evaluate the performance of the four methods, we calculate the MSE in simulation Example 1, the average proportion of nonzero coefficients correctly estimated to be nonzero (denoted as C), and the average proportion of zero coefficients incorrectly estimated to be nonzero (denoted as IC).Note that C ¼ 1 and IC ¼ 0 imply perfect recovery.We further study the computational efficiency of our proposed estimator using the computation time (in seconds).The simulation results for s ¼ 0:2; 0:5 and 0:8 are presented in Tables 6-10, respectively, based on 100 simulation replications.From Tables 6-10, the following conclusions can be drawn: (1) In terms of the MSEs in Tables 6-8, we note that (i) all the estimators are close to the true value because the MSEs are very small; and (ii) for any given number of batches b; p, errors and  quantiles s, the tables show that the MSEs of RPSQR are very close to those of the PSQR and better than those of the PQR.
(2) In terms of the ICs in Table 9, the performance of the proposed PSQR and RPSQR methods are very good with ICs close to 0 under different batches, quantiles and errors.The ICs of the PLS and PQR are all zero and one, respectively, under different batches, quantiles and errors.The performance of the PQR is bad, which may be because the PQR by ''rq" with the ''scad" method should probably be regarded as experimental, as mentioned in the report of the ''quantreg" R package.Therefore, our proposed variable selection method PSQR is a good method for quantile regression, because it runs fast and can accurately select important variables.
(3) The four methods can select all true predictors in all settings and thus we do not report C in the tables.(4) In terms of computation time in Table 10, we note that (i) for any given number of machines K, quantiles s and error terms, the computation time of the PQR is the longest, as expected.Moreover, the proposed PSQR and RPSQR methods are much faster to compute than the PQR.(ii) The computation time of PSQR and RPSQR is very close.Therefore, the renewable method does not add much computational complexity.

Real data Example 1: Beijing multisite air-quality data set
We apply the proposed RSQR method in Section 2 to the analysis of the Beijing multisite air-quality dataset.The dataset includes 420768 hourly air pollutant data points from 12 nationallycontrolled air-quality monitoring sites.The air-quality data are from the Beijing Municipal Environmental Monitoring Center.The meteorological data at each air-quality site were matched with the nearest weather station from the China Meteorological Administration.The time period is from March 1st, 2013 to February 28th, 2017.The dataset was obtained from the following online website: https://archive.ics.uci.edu/ml/datasets/Beijing+ Multi-Site + Air-Q uality + Data.
In this study, we use model (1.1) to explore the relationship between the PM 2.5 concentration (ug/m 3 ) and seven variables in Table 11.Because the data are from 12 nationally-controlled airquality monitoring sites, we set the number of batches b ¼ 12. Fig. 4 depicts the changes in the estimated coefficients for the Beijing multisite air-quality data using our proposed RSQR method with quantiles s ¼ 0:1 to 0:9.From Fig. 4, it is easy to see that the estimated coefficients of TEMP and PRES decrease as quantile s increases, and the other estimated coefficients increase as quantile s increases.Furthermore, we evaluate the performance of the proposed RSQR estimator compared with the SQR, RLS and OLEQR, based on the mean absolute fitting error (MAFE): where n is the total sample size of 388817 (the number of data points after deleting missing data), Y i is the value of PM 2.5 , and b Y i is the fitted value of Y i at quantile s ¼ 0:5.The results are present in Table 12.In terms of the MAFE, we find that the performance of the SQR is the best and that the performance of the RSQR is very close to that of the SQR.The computation time of the RSQR is less than that of the SQR.

Real data Example 2: Year Prediction MSD data set
As an illustration, we now apply the proposed PSQR and RPSQR methodologies in Section 3 to the Year Prediction MSD dataset.The dataset is collected from the public database of the UCI Machine Learning Repository ( https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD).The dataset is a freely-available collection of audio features for contemporary popular music tracks ranging from 1922 to 2011.Approximately 515345 observations were recorded with 91 variables: the year of the song and 12 average timbre and 78 timbre covariance variables.The research problem is to predict the release year of songs from the audio features.
In this study, model (1.1),where the year of a song is the dependent variable (Y) and the 12 average timbre and 78 timbre covari- ance variables are the covariate variables, is used to fit the data.To evaluate the performances of our proposed methods (PSQR and RPSQR) in Section 3, we first calculate the mean absolute prediction error (MAPE) of the predictions under quantile s ¼ 0:5.The first 500000 data points are used for the estimation, and the remaining 15345 data points are used for the prediction.Therefore, where b Y i is the fitted value of Y i and ñ ¼ 15345.The results of MAPEs and their standard deviations by bootstrap method [3], are presented in Table 13.The Table 13 clearly shows that the penalized QR estimators (PQR, PSQR and RPSQR) are better than the penalized LS estimator because of the smaller MAPEs.The performances of PQR, PSQR and RPSQ are very close, because their asymptotic prop-erties are the same, see Theorems 1 and 2 in [38] (PQR), Lemmas 1 and 2 (PSQR) and Theorems 3.1 and 3.2 (RPSQR).Methods PQR and PSQR are a little better than RPSQR because they directly use full data.Moreover, from Theorems 1 and 2, our proposed method RPSQR is not influenced by the number of batches b, so the results (MAPE) of RPSQR for different b are very close.
Furthermore, to illustrate the computational advantage of the proposed methods (PSQR and RPSQR), we also list the running time of our method under different b and quantiles s in Table 14.The results show that PSQR costs much less time than the PQR, and its computation time is close to that of the RPSQR.In addition, we study the number of variables selected by our proposed methods (PSQR and RPSQR).The results are presented in Table 14, which shows that the SCAD produces a small model because the numbers of selected variables under different bs and quantiles levels s are all smaller than the case with p ¼ 90 variables.Moreover, for any given b, the number of selected variables decreases as the quantile level s increases.The performances (MAPE, t and NSV) of the RPSQR under different bs are close to those of the PSQR.

Discussion
In this article, we considered renewable parameter estimation and variable selection for a quantile regression with streaming data sets.The method requires only the availability of the current    data batch in the data stream and sufficient statistics on the historical data (the latest estimator, the cumulative Hessian matrix and the latest regularization parameter for variable selection) in each stage of the analysis.The scale of the data to be stored is p þ 1 ð Þp (or p 2 þ p þ 1 for variable selection) instead of N b p, which is the sample size of streaming data sets up to b batches.Because p is assumed to be a fixed number in this paper, our method greatly reduces the amount of data storage.Theoretically, the proposed estimators achieve optimal efficiency, and their asymptotic covariance matrixes are the same as those of the estimators with full data.Moreover, the proposed renewable methods are all free of the constraint on the number of batches, which means that the new methods are adaptive to the situation where streaming data sets arrive fast and perpetually.As the proposed methods are all based on a convolution-type smoothing approach of the objective function, algorithms 1-3 are all fast and scalable.
From the numerical studies in Section 4, we can that our proposed methods are very close to the estimators directly using all data, and better than other methods in existing reference by smaller MSE.The variable selection method can effectively select important variables.The proposed methods run fast and are faster than the full data estimators for large sample size and large dimension.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ðA:2Þ From equations (A.1), (A.2) and G b bb ¼ 0, we know that ðA:4Þ By (2.9), we can obtain Thus, combining (A.4) and (A.5), ðA:6Þ ðA:8Þ By the Lemmas 1 and 4 in [11], under condition n j !1; j ¼ 1; . . .; b, we can obtain kJ D j ; b; are consistent, and conditions h j N j = ln N j À Á 1=3 ! 1 and n j !1; j ¼ 1; . . .; b, we have By the Lemma 1 and Theorem 5 in [11], we have Plugging (A.9) and (A.10) into equation (A.8), we can obtain By Lemma 3 in [12], we have To prove Lemma 1, it is sufficient to show that for any given d > 0, there exists a large enough constant C such that where b 0;j and h j denote the j-th component of b 0 and h, respectively.Given any fixed h, by the Lemma 1 in [11] and condition h !0, then By choosing a sufficiently large C, the second term of (A.13) dominates Note that SCAD penalty is flat for coefficient of magnitude larger than ak.Thus, by condition k !0, we can obtain that that is to say for any given d 1 > 0, P inf In fact, based on (A.13), for any kb ðA:17Þ where the last equation is because p 0 k jb 0;j j À Á sign b 0;j À Á ¼ 0 and p 00 k jb 0;j j À Á !0, j ¼ 1; . . .; s, by condition k !0. Thus, based on (A.16) and (A.17), by Slutsky's theorem and the central limit theorem, we can prove the part (ii).
Proof of Theorem 3.1.The renewable penalized estimator bb also satisfies    Then, by (A.7) and (A.10), we have

b
large, to speed up the calculation of (2.11), we may avoid updating J D b ; with bbÀ1 leads to the following updating algorithm:

h b ; 7 :: end 9 :
save bb and b J b and release data set D b from the memory; 8Output: bb for b ¼ 2; 3; . . .
where b 01 is an s-dimensional vector of nonzero elements and b 02 ¼ 0 is a p À s ð Þ-dimensional vector of zeros.Theorem 3.2.(Oracle property).Suppose that all conditions in Theorem 3.1 hold.If ffiffiffiffiffiffi N b p k b ! 1, then with a probability tending to one, the root-N b consistent local minimizer bb ¼ b> b1 ð Þ ; b> b2 ð Þ > in Theorem 3.1 satisfies the following: :8Þ where bÃ k and bb;k b are the penalized estimators of b 0 by (3.1) and (3.7) given k and k b , respectively; and df k and df k b are the number of nonzero coefficients in bÃ k and bb;k b , respectively.

; 14 :
save bb ; e J b and k b , release data set D b from the memory; 15: end 16: Output: bb for b ¼ 2; 3; . . .Note that in step 14 in Algorithm 3, we only need to save bb ; e J b and k b .The scale of data to be stored is p 2 þ p þ 1 instead of N b p (the sample size of streaming data sets up to b batches).
Þ. To evaluate the performance of the four methods, we calculate the mean squared error (MSE): k b À b 0 k 2 and computation time (in seconds).Simulation results are based on 100 simulation replications.4.1.1.Fixed n with varying b In this section, we fix the sample size of each batch as n ¼ 200 and vary the number of batches b ¼ 100; . . .; 1000 for s ¼ 0:3; 0:5 and 0:7.From Figs. 1-3, the following conclusions can be drawn: (1) In terms of the MSEs in Figs. 1 and 2, we note that (i) all the estimators are close to the true value because the MSEs are very small; and (ii) for any given number of batches b; p, errors and quantiles s, the figures show that the MSEs of the proposed estimator (RSQR) are very close to those of the SQR and better than those of the OLEQR.

Fig. 1 .Fig. 2 .
Fig. 1.The mean MSEs under different batches b, quantiles s, methods and errors for simulation Example 1 with a fixed n and p = 10.

Fig. 3 .
Fig. 3.The computation times (in seconds) under different batches b; p, methods and errors for simulation Example 1 with a fixed n and s ¼ 0:5.

Fig.
Fig. The estimated coefficients of RSQR under different s for real data Example 1.

P 1
ð Þ ¼ 1 2 b À bbÀ1 À Á > e J bÀ1 b À bbÀ1 À Á þ S h b D b ; b ð ÞÀN bÀ1 p 0 k bÀ1 j bbÀ1 j À Á sign bbÀ1 À Á b À bbÀ1 À Á þ N b p k b jbj ð Þ.Similar to the proof ofLemma 1, it is sufficient to show that for any given d > 0, there exists a large enough constat e C such thatP inf k hk 2 ¼ e C e Q b b 0 þ hb = ffiffiffiffiffi ffi À d: ðA:18ÞIt implies that with probability at least 1 À d there exists a localminimizer satisfying k bb À b 0 k 2 ¼ O p N À1=2 b .By the Lemma 1, k b1 À b 0 k 2 ¼ O p n À1=2 1 , if k bj À b 0 k 2 ¼ O p N À1=2j with j ¼ 1; . . .; b À 1, and by k hk 2 ¼ e C , we have e Q b dominated by the quadratic term h> X h=2 for k hk equal to sufficiently large e C .Hence equation (A.18) is satisfied.Proof of Theorem 3.2.(i) For any kb 1 ð Þ À b 01 k 2 ¼ O p N À1=2

1 ðk
for large N b .This completes the proof of part (i) of the theorem.(ii)From Theorem 3.1 and part (i), we know that b b1 ð Þ is a root-N b consistent local minimizer of e Q b b > bj À b 0 k 2 ¼ O p N À1=2j and conditions h j ¼ o N À1=4 j ; j ¼ 1; . . .; b.

Table 3
The means and standard deviations (in parentheses) of the MSEs (Â100) under different batches b, methods and errors for simulation Example 1 with a fixed N b and s ¼ 0:1.

Table 4
The means and standard deviations (in parentheses) of the MSEs (Â100) under different batches b, methods and errors for simulation Example 1 with a fixed N b and s ¼ 0:9.

Table 5
The mean computation times (in seconds) under different full data sample sizes N b , batches b, methods and errors for simulation Example 1 with a fixed p ¼ 10 and s ¼ 0:5.

Table 6
The means and standard deviations (in parentheses) of the MSEs (Â100) under different batches b, methods and errors for simulation Example 2 with a fixed n ¼ 400 and s ¼ 0:5.

Table 7
The means and standard deviations (in parentheses) of the MSEs (Â100) under different batches b, methods and errors for simulation Example 2 with a fixed n ¼ 400 and s ¼ 0:2.

Table 8
The means and standard deviations (in parentheses) of the MSEs (Â100) under different batches b, methods and errors for simulation Example 2 with a fixed n ¼ 400 and s ¼ 0:8.

Table 9
The mean C (Â100) of the PSQR, and RPSQR estimators under different s, b and errors for simulation Example 2.

Table 10
The mean computation times (in seconds) of the PQR, PSQR, and RPSQR estimators with s ¼ 0:5 under different b and errors for simulation Example 2.

Table 13
The MAPEs and standard deviations (in parentheses) of the PLS, PQR, PSQR, and RPSQR estimators with s ¼ 0:5 under different bs for real data Example 2.

Table 11
Covariates and their descriptions for real data Example 1.

Table 12
The MAFEs and computation times (in seconds) with s ¼ 0:5 for the RLS, SQR, OLEQR and RSQR estimators for real data Example 1.