.. _helpful-code-snippets:

Helpful Code Snippets
=====================

Unlike notebooks, these are not self-contained examples, but are meant to be a quick reference for specific tasks.
These are not necessarily exhaustive examples; see full class documentation for all possible arguments and options.
For a more full description of the SEPIA work flow, see :ref:`workflow`.

:ref:`SepiaData inputs`

:ref:`SepiaData operations`

:ref:`Model setup`

:ref:`Customize and run MCMC`

:ref:`Extract MCMC samples`

:ref:`Make predictions`

:ref:`Hierarchical or shared theta models`


SepiaData inputs
----------------

:ref:`sepiadata` objects are used to hold various types of data inputs to a model.
The types and sizes of each input to `SepiaData` helps determine which kind of model is to be set up.

A SEPIA model may contain only simulation data (an emulator-only model) or both simulation and observed data.

It is always good to call `print(data)` on your `SepiaData` object to verify your setup is as intended.

Examples of data setup for different kinds of SEPIA models (see :ref:`sepiadata` for fuller explanation of inputs):

Univariate-output emulator-only data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::

    data = SepiaData(t_sim=t, y_sim=y)           # No controllable input
    data = SepiaData(x_sim=x, y_sim=y)           # Only controllable input
    data = SepiaData(x_sim=x, t_sim=t, y_sim=y)  # Controllable and other inputs
    # Indicate that third column of x is categorical with 5 categories (takes values in [1, 2, 3, 4, 5])
    data = SepiaData(x_sim=x, y_sim=y, x_cat_ind=[0, 0, 5])

Multivariate-output emulator-only data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::

    data = SepiaData(t_sim=t, y_sim=y, y_ind_sim=y_ind)           # No controllable input
    data = SepiaData(x_sim=x, y_sim=y, y_ind_sim=y_ind)           # Only controllable input
    data = SepiaData(x_sim=x, t_sim=t, y_sim=y, y_ind_sim=y_ind)  # Controllable and other inputs

Univariate-output simulation and observed data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::

    data = SepiaData(t_sim=t, y_sim=y, y_obs=y_obs)                        # No controllable input
    data = SepiaData(x_sim=x, t_sim=t, y_sim=y, x_obs=x_obs, y_obs=y_obs)  # Controllable and other inputs

Multivariate-output simulation and observed data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::

    data = SepiaData(t_sim=t, y_sim=y, y_ind_sim=y_ind,
                     y_obs=y_obs, y_ind_obs=y_ind_obs)               # No controllable input
    data = SepiaData(x_sim=x, t_sim=t, y_sim=y, y_ind_sim=y_ind,
                     x_obs=x_obs, y_obs=y_obs, y_ind_obs=y_ind_obs)  # Controllable and other inputs

Multivariate-output simulation and observed data with ragged observations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Ragged observations means the observed data indices vary across observation data instances.
In this case, `y_obs` and `y_ind_obs` are now lists instead of numpy arrays::

    y_obs = [np.array([[0.3, 0.5, 0.7]]), np.array([[0.1, 0.4, 0.6, 0.9]])
    y_ind_obs = [np.array([1, 2, 3]), np.array([0.5, 2.5, 4, 6])]
    data = SepiaData(t_sim=t, y_sim=y, y_ind_sim=y_ind, y_obs=y_obs, y_ind_obs=y_ind_obs) # No controllable input

SepiaData operations
--------------------

Regardless of the inputs given to `SepiaData`, there are a few key methods which generally should be called before
setting up the model.

Transformations
^^^^^^^^^^^^^^^

First, we want to transform `x` and `t` inputs to the unit hypercube::

    data.transform_xt()

Next, we want to standardize the `y` outputs::

    data.standardize_y()

Basis functions
^^^^^^^^^^^^^^^

If the outputs are multivariate, we want to set up a principal component (PC) basis and optionally, a discrepancy basis::

    # PC basis
    data.create_K_basis(n_pc=5)     # With 5 PCs
    data.create_K_basis(n_pc=0.99)  # Enough PCs for at least 99 pct variance explained

    # Discrepancy basis -- optional
    data.create_D_basis(D_type='linear')  # Default linear discrepancy
    data.create_D_basis(D=D)              # Pass in custom D basis

Set up model
------------

Once the data structure is set up correctly, the inputs are in the unit hypercube, the outputs are standardized,
and basis vectors are created (for multivariate output), we are ready to set up the Sepia model::

    model = SepiaModel(data)

The model parses the `SepiaData` structure to understand what kind of model is being set up and does a lot of
precomputation of various quantities to prepare for likelihood evaluations.
It also sets up default priors, MCMC step types and step sizes, and default starting values for MCMC.

To see information about the default setup, you can use::

    model.print_prior_info()
    model.print_value_info()
    model.print_mcmc_info()


Customize and run MCMC
----------------------

After instantiating the `SepiaModel` object, various aspects of the MCMC can be customized prior to doing MCMC.

Custom start values
^^^^^^^^^^^^^^^^^^^

Each parameter in the model has attribute `val` which holds the start values (or, during MCMC, the current values).
Prior to running MCMC, these can be directly modified using the `set_val` method::

    # Single scalar applies to all thetas
    model.params.theta.set_val(0.7)
    # Or pass an array of shape model.params.theta.val_shape
    model.params.theta.set_val(np.array([[0.7, 0.5, 0.1]]))


Fixing subsets of parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It may sometimes be desirable to fix the values of certain parameters.
The `fixed` attribute of `SepiaParam` is a boolean array of size `val_shape` (all `False` by default)::

    model.params.lamWOs.fixed = np.array([[True, False]])

Change prior distribution and prior parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently, there are only four distributions supported for priors: Normal, Gamma, Beta, and Uniform.
After instantiating the `SepiaModel`, we can modify priors as follows::

    prior_dist_name = 'Normal'
    prior_mu = 0.1
    prior_sd = 2.0
    prior_bounds = [0, 1]
    model.params.theta.prior = SepiaPrior(model.params.theta, dist=prior_dist_name, params=[prior_mu, prior_sd],
                                          bounds=prior_bounds)


Change MCMC step sizes or step types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can manually change MCMC step types or step sizes::

    model.params.theta.mcmc.stepType = 'Uniform'
    model.params.theta.mcmc.stepParam = np.array([[0.5, 0.1, 0.3]])


Automatic MCMC step size tuning
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Automatic step size tuning based on `YADAS`_::

    model.tune_step_sizes(n_burn, n_levels)

.. _YADAS: https://arxiv.org/abs/1103.5986

Run MCMC or add more samples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To do MCMC sampling, call `do_mcmc`::

    model.do_mcmc(500)

To append more samples to the current samples, you can call it again::

    model.do_mcmc(500) # Now has 1000 total samples


Saving MCMC chains or model info
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We will build some functions to handle this more smoothly, but for now you could do something like::

    import pickle
    for chunk in range(10):
        model.do_mcmc(500)
        with open('samples%d.pkl' % chunk, 'wb') as f:
            pickle.dump(model.get_samples(numsamples=500), f)

See also `SepiaModel.save_model_info()` and `SepiaModel.restore_model_info()`.

Extract MCMC samples
--------------------

To extract MCMC samples to a dictionary format::

    # Select a fixed set of samples
    model.get_samples(nburn=0, sampleset=np.arange(100), flat=True, includelogpost=True)

    # Select a fixed number of samples
    model.get_samples(nburn=0, numsamples=200, flat=True, includelogpost=True)

    # Discarding nburn samples
    model.get_samples(nburn=50, numsamples=200, flat=True, includelogpost=True)

    # Keep samples in array format rather than flattening along parameter dimensions
    model.get_samples(nburn=50, numsamples=200, flat=False)

    # Returns only a set of "effective samples" determined by effective sample size
    samples = model.get_samples(effectivesamples=True)

MCMC diagnostics
----------------

Several graphical diagnostics are available::

    plot_acf(model, nlags=30) # Autocorrelation function and effective sample size calculation
    mcmc_trace(samples)       # Trace plots
    theta_pairs(samples)      # Pairs plots of theta variables
    rho_box_plots(model)      # Box plots of GP lengthscale parameters
    param_stats(samples)      # Summary statistics of parameters

Each returns a `matplotlib` figure object that you can save using `plt.savefig()` or show using `plt.show()`.

Make predictions
----------------

To make predictions, use the :ref:`sepiapredict` class.
There are different types of predictions, and predictions can be made
in the model space (`w, u, v`) or output space (`y`), with or without standardization.

Emulator predictions
^^^^^^^^^^^^^^^^^^^^

To predict from the emulator (eta portion of model), first set up the `SepiaEmulatorPrediction` object::

    # Provide input settings to predict at
    x_pred = np.linspace(0,1,9).reshape((9,1))
    t_pred = np.tile(np.array([1,0,1]).reshape(1,3),(9,1))
    pred_samples = model.get_samples(numsamples=10)
    pred = SepiaEmulatorPrediction(x_pred=x_pred, samples=pred_samples, model=model, t_pred=t_pred)

To get w::

    predw = pred.get_w()

To get y on the standardized scale::

    predystd = pred.get_y(std=True)

To get y on the native (original) scale::

    predystd = pred.get_y()


Full model predictions
^^^^^^^^^^^^^^^^^^^^^^

To predict from full model (including observation noise, and discrepancy, if applicable)::

    x_pred = np.linspace(0,1,9).reshape(9,1)
    pred_samples = model.get_samples(numsamples=7)
    pred = SepiaFullPrediction(x_pred, pred_samples, model)

To get u, v::

    u_pred, v_pred = pred.get_u_v()

To get discrepancy::

    preddstd = pred.get_discrepancy(std=True) # Standardized scale
    predd = pred.get_discrepancy()            # Native/original scale

To get simulated y::

    predysimstd = pred.get_ysim(std=True) # Standardized scale
    predysim = pred.get_ysim()            # Native/original scale

To get y (simulator+discrepancy)::

    predy = pred.get_yobs()                            # Native/original scale
    predystd = pred.get_yobs(std=True)                 # Standardized scale
    predystdobs = pred.get_yobs(std=True, as_obs=True) # Standardized scale at only observed data locations x_obs
    
Cross-validation
^^^^^^^^^^^^^^^^

By default, leave-one-out cross validation is done on the emulator model::

    pred_samples = model.get_samples(numsamples=10)
    CV_pred = SepiaXvalEmulatorPrediction(samples=pred_samples, model=model)

    CV_pred_y = CV_pred.get_y()

You can also provide custom sets of indices to leave out in turn, such as leaving out chunks of 10 examples at a time,
and you can add residual variance to the predictions::

    leave_out_inds = np.array_split(np.arange(m), 5)
    pred_samples = model.get_samples(numsamples=7)
    CV_pred = SepiaXvalEmulatorPrediction(samples=pred_samples, model=model, leave_out_inds=leave_out_inds, addResidVar=True)


Sensitivity analysis
--------------------

Run sensitivity analysis::

    model.do_mcmc(1000)
    samples = model.get_samples(20)
    sens = sensitivity(model, samples)

Hierarchical or shared theta models
-----------------------------------

The syntax for both cases is similar. First, set up each model, then put them in a list::

    m1 = SepiaModel(d1)
    m2 = SepiaModel(d2)
    m3 = SepiaModel(d3)
    model_list = [m1, m2, m3]

Then we need to specify which thetas are shared or modeled hierarchically. The way to do this is with a numpy array
of size `(j, n_models)` where each row represents one of the shared/hierarchical theta variables,
and each column gives the index of the shared/hierarchical theta in the respective model. For instance::

    theta_inds = np.array([[0, 0, 0], [1, 1, 2], [-1, 2, 3]])

This means that the first shared/hierarchical theta is `theta_0` in model 1, `theta_0` in model 2, and `theta_0` in model 3.
The second shared/hierarchical theta is `theta_1` in model 1, `theta_1` in model 2, and `theta_2` in model 3.
The third shared/hierarchical theta is *not* in model 1, is `theta_2` in model 2, and is `theta_3` in model 3.
The index -1 is used to indicate that a particular shared/hierarchical theta is not in a particular model.

Then the model setup is::

    shared_model = SepiaSharedThetaModels(model_list, theta_inds)     # Shared version
    hier_model = SepiaHierarchicalThetaModels(model_list, theta_inds) # Hierarchical version

MCMC is done similarly to regular models::

    shared_model.do_mcmc()

Step size tuning is not supported on shared or hierarchical theta models.