Histograms

  1. Basic principles
  2. Basic output format
  3. Matplotlib output format
  4. The methods
The Hist class gives a simple implementation of one-dimensional histograms, useful for quick-and-dirty testing, without the need to link to more sophisticated packages. For this reason it is used in many of the sample main programs found in the examples subdirectory.

Basic principles

We here provide a simple overview of what is involved. As a first step you need to declare a histogram, with name, title, number of bins and x range (from, to).
 
   Hist ZpT( "Z0 pT spectrum", 100, 0., 100.); 
Alternatively you can first declare it and later define it:
 
   Hist ZpT; 
   ZpT.book( "Z0 pT spectrum", 100, 0., 100.); 
Once declared, its contents can be added by repeated calls to fill,
 
   ZpT.fill( 22.7, 1.); 
where the first argument is the x value and the second the weight. Since the weight defaults to 1 the last argument could have been omitted in this case.

A set of overloaded operators have been defined, so that histograms can be added, subtracted, divided or multiplied by each other. Then the contents (and associated statistical errors) are modified accordingly bin by bin. Thus the relative deviation between two histograms data and theory can be found as

 
  diff = (data - theory) / (data + theory); 
assuming that diff, data and theory have been booked with the same number of bins and x range. That responsibility rests on the user; some checks are made for compatibility, but not enough to catch all possible mistakes.

Also overloaded operations with double real numbers are available. Again these four operations are defined bin by bin, i.e. the corresponding amount is added to, subtracted from, multiplied by or divided by each bin. The double number can come before or after the histograms, with obvious results. Thus the inverse of a histogram result is given by 1. / result. The two kind of operations can be combined, e.g.

 
  allpT = ZpT + 2. * WpT 
Finally, also the +=, -+, *=, /= are overloaded, with the right-hand side being either a histogram or a real number.

Basic output format

A histogram can be printed by making use of the overloaded << operator, e.g.:
 
   cout << ZpT; 
The printout format is inspired by the old HBOOK one. To understand how to read this format, consider the simplified example
 
        3.50*10^ 2  9 
        3.00*10^ 2  X   7 
        2.50*10^ 2  X  1X 
        2.00*10^ 2  X6 XX 
        1.50*10^ 2  XX5XX 
        1.00*10^ 2  XXXXX 
        0.50*10^ 2  XXXXX 
 
          Contents 
            *10^ 2  31122 
            *10^ 1  47208 
            *10^ 0  79373 
 
          Low edge  -- 
            *10^ 1  10001 
            *10^ 0  05050 
The key feature is that the Contents and Low edge have to be read vertically. For instance, the first bin has the contents 3 * 10^2 + 4 * 10^1 + 7 * 10^0 = 347. Correspondingly, the other bins have contents 179, 123, 207 and 283. The first bin stretches from -(1 * 10^1 + 0 * 10^0) = -10 to the beginning of the second bin, at -(0 * 10^1 + 5 * 10^0) = -5.

The visual representation above the contents give a simple impression of the shape. An X means that the contents are filled up to this level, a digit in the topmost row the fraction to which the last level is filled. So the 9 of the first column indicates this bin is filled 9/10 of the way from 3.00*10^2 = 300 to 3.50*10^2 = 350, i.e. somewhere close to 345, or more precisely in the range 342.5 to 347.5.

The printout also provides some other information, such as the number of entries, i.e. how many times the histogram has been filled (including both underflow and overflow), the total weight inside the histogram (excluding underflow and overflow), the total weight in underflow and overflow, and the statistical power of the contents inside the histogram expressed in terms of nEffective = ( sum ( w ) )^2 / ( sum w^2 ). For an unweighted histogram, the latter is equal to the number of entries (excluding underflow and overflow), while for one filled with varying weights, it is less than that.

Also printed are the mean value, median, and root-mean-square width (RMS). (Further moments about the mean can be computed via dedicated getter methods, see below.) If doStats == true then statistical errors are also printed, summed in quadrature with an estimate of the error due to the finite bin granularity (calculated as the difference with respect to unbinned estimates of the same quantities).

In the printout, the moments and the median all disregard underflow and overflow and assume that all the contents is in the middle of the respective bin (with the median using a linear interpolation inside the median bin). This is especially relevant when you plot an integer quantity, such as a multiplicity. Then it makes sense to book with limits that are half-integers, e.g.

 
   Hist multMPI( "number of multiparton interactions", 20, -0.5, 19.5); 
so that the bins are centered at 0, 1, 2, ..., respectively. This also avoids ambiguities which bin gets to be filled if entries are exactly at the border between two bins. Also note that the fill( xValue) method automatically performs a cast to double precision where necessary, i.e. xValue can be an integer.

Matplotlib output format

Assuming you have Python installed on your platform, it is possible to generate simple Matplotlib/Pyplot Python code from the histograms generated above, which then can be run to produce PDF plots. This should be done near the end of a run, after the histograms have been filled and properly normalized, as an alternative or complement to the basic output format above.

In a first step you must then decide on the name of the Python program, e.g.:

 
  HistPlot hpl( "bosonpT"); 
where file ending .py is added automatically.

For each new frame the name should be given, which will later give rise to a PDF file, with ending .pdf added automatically. If you leave the name field empty the same file will be used as for the latest named one, i.e. producing several frames in one file. One can optionally also give title and x and y axis labels:

 
  hpl.frame( "pTdist", "Boson pT distributions", "pT (GeV)", "sigma"); 
Next, existing Hist histograms can be added to the frame, one by one:
 
  hpl.add( ZpT, "-"); 
  hpl.add( WpT, "--,indigo"); 
  hpl.add( ZpT + 2. * WpT , "", "pT spectrum of Z, W+ and W-"); 
where the second argument tells how each histogram will be plotted. Default is histogram style, "h", but the values can also be connected with full lines "-", dashed ones "--", or dash-dotted ones "-.", or plotted as points "." or crosses "x", to mention some of the many options offered by Pyplot. As alternative to "h", the argument "e" can be specified which provides the same style as "h" but with error bars included. You can also specify the colour, separated by a comma from the line style, to override the normal colour cycle. The most common colours can be given just as a single letter, such as "r", "g", "b", but a more extensive colour palette allow finetuning to nuances such as "orange", "gold", "darkgreen", "royalblue", "crimson", and so on. A third argument can set the legend of each histogram; by default it is taken as the title of histogram.
Finally the plot code itself will be set up by
 
  hpl.plot(); 
where optionally it is possible to demand a logarithmic y scale.

The frame - add - plot steps can be repeated as needed, each giving rise to a separate PDF file with a plot. In case a plot is to be generated from a single histogram the three steps can be joined into one

 
  hpl.plotFrame( "onlyZ", ZpT ); 
where only the name of the PDF file and the histogram are compulsory, while further arguments as discussed above are optional.

At the end, a file bosonpT.py has now been generated with the proper plotting commands. Additionally a data file has been generated for each histogram to be plotted, pTdist-0.dat, pTdist-1.dat, etc. Now doing

 
  python bosonpT.py 
in a terminal window will produce the plots, such as pTdist.pdf and onlyZ.pdf. You may of course edit the python file further to improve on the finer details.

Examples are provided in main03.cc and main07.cc, where the latter is the simpler one. The main51.cc example illustrates that the x scale can be chosen logarithmically, by using an optional last argument when histograms are booked.

Alternatively, it also straight forward to produce histograms direcly through the Python interface using the binEdges and binContents methods of the Hist class.

 
  import pythia8, random 
  from matplotlib import pyplot 
  hst = pythia8.Hist("log", 100, 0.1, 4, True) 
  for i in range(0, int(1e6)): hst.Fill(random.gauss(0, 1)) 
  pyplot.hist(hst.getBinEdges()[:-1], hst.getBinEdges(), 
               weights = hst.getBinContents(), 
               histtype = "step", label = hst.getTitle()) 

The methods

We here collect a more complete and formal overview of the methods.

Hist::Hist()  
declare a histogram, but does not define it.

Hist::Hist(string title, int numberOfBins, double xMin, double xMax, bool logX = false, bool doStats = false)  
declare and define a histogram, where
argument title : is a string with the title of the histogram at output,
argument numberOfBins : is the number of bin the x range will be subdivided into, limited to be at most 1000,
argument xMin : is the lower edge of the histogram,
argument xMax : is the upper edge of the histogram,
argument logX : gives a logarithmic x scale between xMin and xMax if set true. Note that then xMin must be above 1e-20 so as to stay with positive numbers.
argument doStats : decides whether statistics allowing unbinned higher moments to be calculated are saved (up to n = 7) and also whether the histogram summary printout via the << operator includes +/- uncertainty estimates on the Mean, Median, and RMS values.

Hist::Hist(const Hist& h)  
creates an identical copy of the histogram in the argument, including bin contents.

Hist::Hist(string title, const Hist& h)  
creates an identical copy of the histogram in the argument, including bin contents, except that a new title is provided as first argument.

Hist& Hist::operator=(const Hist& h)  
copies all properties of the histogram in the argument, except that the original histogram title is retained.

void Hist::book(string title, int numberOfBins, double xMin, double xMax, bool logXIn = false, bool doStatsIn = false)  
define a histogram that previously was only declared; see above for the meaning of the arguments.

static Hist Hist::plotFunc(function<double(double)> f, string title, int nBin, double xMin, double xMax, bool logX = false)  
create a histogram out of a provided function f(x). The function is evaluated in the middle of each bin.

void Hist::title(string title)  
change the title of a histogram, but keep other properties unchanged.

void Hist::null()  
reset bin contents, but keep other histogram properties unchanged.

void Hist::fill(double xValue, double weight)  
fill the histogram, where
argument xValue : is the x position where the filling should occur, and
argument weight (default = 1.) : is the amount of weight to be added at this x value.

friend ostream& operator<<(ostream& os, const Hist& h)  
appends a simple histogram printout (see above for format) to the ostream, while leaving the histogram object itself unchanged. At most 100 columns are allowed to be displayed. If the number of bins is larger than 100 then the contents of adjacent bins are added to give the value in each column. (Two by two up to 200 bins, three by three up to 300, and so on, with the very last column possibly summing fewer rows than the others.)
If the histogram has been booked with logarithmic x scale then the log10 of each lower bin border will be displayed rather than the border itself, to avoid undue complication. The Pyplot interface will display correct values, however.

void Hist::table(ostream& os = cout, bool printOverUnder = false, bool xMidBin = true, bool printError = true)  
void Hist::table(string fileName, bool printOverUnder = false, bool xMidBin = true, bool printError = true)  
print a three-column table, where the first column gives the center of each bin, the second one the corresponding bin contents, and the bin errors. The table may be useful for plotting e.g. with Gnuplot.
The desired output stream or file name can be provided as argument. The former is more flexible (e.g., it allows easy append to an existing file), whereas the latter is simpler for the case that each histogram should be a file of its own.
An optional printOverUnder = true argument allows also underflow and overflow contents to be printed. (The arbitrary x coordinates for these are placed as if corresponding to same-size bins just below or above the regular histogram bins.)
An optional xMidBin = false argument will have the x value at the beginning of each bin printed, rather than the default midpoint value. Finally, the optional printError = false will force the values in the third column to be zero.

void Hist::rivetTable(ostream& os = cout, bool printError = true)  
void Hist::rivetTable(string fileName, bool printError = true)  
print a five-column table, where the first two columns give the lower and upper borders of each bin, the third one the bin contents, and the fourth and fifth the error (up and down) associated with the contents. This format matches the one that Rivet uses for its histograms. The choice between the two methods is the same as above for the table methods.
With the optional printError = false the errors will be output as zero.

void Hist::pyplotTable(ostream& os = cout, bool isHist = true)  
void Hist:pyplotTable(string fileName, bool isHist = true)  
prints either a two- or a three-column table, depending on whether the contents are to be plotted as a curve or as a histogram, the latter being default. In either case the first column gives the center of each bin and the second one the corresponding bin contents. For a histogram the third column gives the lower border of each bin. Then also a final line is provided, that gives no further histogram contents but provides the upper border of the last bin. This format is used for plotting with Pyplot, see below. The choice between the two methods is the same as above for the table methods.

void Hist::fillTable(ostream& os = cout)  
void Hist::fillTable(string fileName)  
fill the contents of a two-column table into the histogram. The first column should give the center of each bin and the second one the corresponding bin contents to be filled. This is basically the inverse of the table methods above, whereby a written histogram contents can be read back in. Over- and underflow will be read back in if provided, but the xMidBin = false option will not work.

friend void table(const Hist& h1, const Hist& h2, ostream& os = cout, bool printOverUnder = false, bool xMidBin = true)  
friend void table(const Hist& h1, const Hist& h2, string fileName, bool printOverUnder = false, bool xMidBin = true)  
print a three-column table, where the first column gives the center of each bin and the second and third ones the corresponding bin contents of the two histograms. Only works if the two histograms have the same x axis (within a tiny tolerance), else nothing will be done. The optional last two arguments allows also underflow and overflow contents to be printed, and the x to refer to the beginning of the bin rather than the center; see above.

string Hist::getTitle()  
return the title of the histogram.

int Hist::getBinNumber()  
return the number of bins in the histogram.

int Hist::getNonFinite()  
return the number of entries that are neither infinite nor NaN.

bool Hist::getLinX()  
return true if the histogram has a linear x scale and false if a logarithmic one.

bool Hist::getXMin()  
bool Hist::getXMax()  
return the lower and upper edge of the booked x range.

bool Hist::getYMin()  
bool Hist::getYMax()  
return the highest and lowest value of any bin inside the histogram, i.e. excluding underflow and overflow.

double Hist::getYAbsMin()  
returns the smallest absolute value (above 1e-20, to sidestep bins with zero content) in any of the bins inside the histogram.

double getXMean(bool unbinned=true)  
double getXMeanErr(bool unbinned=true)  
Return mean in X and error on mean, either unbinned from saved weight sums (default) or explicitly from the histogram bins (unbinned = false). In the latter case, the error estimate includes a measure of the error due to the finite bin granularity, calculated as the difference between the binned and unbinned values and summed in quadrature with the statistical error.

double getXMedian(bool includeOverUnder=true)  
double getXMedianErr(bool unbinned=true)  
Return median in X and its statistical error, ignoring underflow and overflow (default) or including them (includeOverUnder = true). By default, the reported error includes the same granularity estimate as for the mean value (and also computed using the mean, not the median), but this can be switched off (unbinned = false).

double getYMean()  
Return the average value in Y, calculated as the sum of weights inside the histogram range divided by the number of times it was filled.

double getXRMN(int n=2, bool unbinned=true)  
double getXRMNErr(int n=2, bool unbinned=true)  
Return the n'th root of the n'th moment about the mean, < (x - <x>)^n >, and its estimated statistical error. The default value n = 2 corresponds to the standard root-mean-square width. Up to n = 6, both unbinned and binned moments can be calculated. For n >= 7, and for all error estimates, only binned calculations are available. Note that (unlike ROOT), the error estimates do not assume normal distributions. If unbinned = false an estimate of the bin granularity error is added in quadrature to the statistical error (up to n = 6, beyond which the bin granularity error is not estimated).

double getXRMS(bool unbinned=true)  
double getXRMSErr(bool unbinned=true)  
Wrapper of getXRMN() to return the root-mean-square width (aka the root of the second central moment, aka the root-mean-square deviation about the mean), and its error. If unbinned = false an estimate of the bin granularity error is added in quadrature to the statistical error.

double Hist::getBinContent(int iBin)  
return the value in bin iBin, ranging from 1 through numberOfBins, with 0 for underflow and numberOfBins + 1 for overflow.

double Hist::getBinEdge(int iBin)  
return the lower edge of bin iBin, ranging from 1 through numberOfBins.

double Hist::getBinWidth(int iBin)  
return the width of bin iBin, ranging from 1 through numberOfBins.

vector<double> Hist::getBinContents()  
return the bin contents.

vector<double> Hist::getBinEdges()  
return the bin edges.

int Hist::getEntries(bool alsoNonFinite = true)  
return the number of entries, i.e. the number of time that fill(...) has been called. If alsoNonFinite is false, then entries that are infinity or NaN are ignored.

double Hist::getWeightSum(bool alsoOverUnder = true)  
return the sum of weights with which the histogram was filled. The output of this method only differs from that of getEntries() above if non-unity weights were used when the histogram was filled. If alsoOverUnder is true, then the overflow and underflow bins are included in the weight sum, otherwise not.

double Hist::getNEffective()  
Return effective entries (for weighted histograms = number of equivalent unweighted events for same statistical power), calculated as (sum(w))^2 / sum(w^2).

bool Hist::sameSize(const Hist& h)  
checks that the number of bins and upper and lower limits are the same as in the histogram in the argument.

void Hist::takeFunc(function<double(double)> func)  
pass an arbitrary function that takes as an argument a double and returns a double. This function is applied to the current contents bin by bin.

void Hist::takeLog(bool tenLog = true)  
by default take 10-logarithm of current contents bin by bin. With optional argument false instead take e-logarithm of contents bin by bin. If to be used, then right before the histogram is output.

void Hist::takeSqrt()  
take square root of current contents bin by bin, with negative contents set to zero.

void Hist::normalize( double f = 1., bool overflow = true)  
Rescale the histogram to a given sum of all bin contents, where
argument sum (default = 1.) : is the intended sum of all bin contents after rescaling, and
argument overflow (default = on) : includes the underflow and overflow bins in the definition of the original area to be rescaled, or not. Either way, the under/overflow are always rescaled by the same factor as the histogram proper.

void Hist::normalizeIntegral( double f = 1., bool alsoOverflow = true)  
Rescale the histogram so that it integrates to the value given by
argument f (default = 1.) : . This is the appropriate normalization if the histogram represents a probability distribution. The argument
argument overflow (default = on) : specifies whether the underflow and overflow bins in the definition of the original area to be rescaled, or not. Either way, the under/overflow are always rescaled by the same factor, except the bin width, as the histogram proper. Bin widths are handled for both linear and log plots.

void Hist::normalizeSpectrum( double wtSum)  
Rescale the histogram by a factor 1/(wtSum * dx) where dx is the bin width. This is the appropriate renormalization if the histogram represents a spectrum on the form (1/N_{ev}) dN/dx.
argument wtSum (default = 1.) : is the total weight of the events, which if events are unweighted is just the total number of successfully generated events. The under/overflow are always rescaled by the same factor, except the bin width, as the histogram proper. Bin widths are handled for both linear and log plots.

Hist& Hist::operator+=(const Hist& h)  
Hist& Hist::operator-=(const Hist& h)  
adds or subtracts the current histogram by the contents of the histogram in the argument if sameSize(...) is true, else does nothing.

Hist& Hist::operator*=(const Hist& h)  
Hist& Hist::operator/=(const Hist& h)  
multiplies or divides the current histogram by the contents of the histogram in the argument if sameSize(...) is true, else does nothing.

Hist& Hist::operator+=(double f)  
Hist& Hist::operator-=(double f)  
adds or subtracts each bin content by the common offset f.

Hist& Hist::operator*=(double f)  
Hist& Hist::operator*=(double f)  
multiplies or divides each bin content by the common factor f.

friend Hist operator+(double f, const Hist& h1)  
friend Hist operator+(const Hist& h1, double f)  
friend Hist operator+(const Hist& h1, const Hist h2)  
add a constant to a histogram or two histograms to each other, bin by bin.

friend Hist operator-(double f, const Hist& h1)  
friend Hist operator-(const Hist& h1, double f)  
friend Hist operator-(const Hist& h1, const Hist h2)  
subtract a histogram from a constant, a constant from a histogram, or two histograms from each other, bin by bin.

friend Hist operator*(double f, const Hist& h1)  
friend Hist operator*(const Hist& h1, double f)  
friend Hist operator*(const Hist& h1, const Hist h2)  
multiply a constant by a histogram or two histograms by each other, bin by bin.

friend Hist operator/(double f, const Hist& h1)  
friend Hist operator/(const Hist& h1, double f)  
friend Hist operator/(const Hist& h1, const Hist h2)  
divide a constant by a histogram, a histogram by a constant, or two histograms by each other, bin by bin.

HistPlot::HistPlot(string pythonName, bool useLegacy = true)  
create a file to which successively Python commands can be written.
argument pythonName : the name of the Python code file will become pythonName.py. When later executed with python pythonName.py the frames defined below will be drawn and output.
argument useLegacy : A parameter necessary for compatibility with older versions of Matplotlib. For Matplotlib version 3.2 and earlier, this argument must be false; for version 3.5 or later, it must be true.

void HistPlot::frame( string frameName, string title = "", string xLabel = "", string yLabel = "", double xSize = 8., double ySize = 6.)  
command to open a frame where successive histograms can be added and plotted.
argument frameName : the name of the frame. When the Python program is run, the output PDF file will be frameName.pdf. If this is blank or it is the same name as the previous frame, then the frame will be appended to the most recent previous PDF file. It cannot be left blank for the first call.
argument title : an optional title of the histogram, plotted on top of the frame.
argument xLabel : an optional label plotted below the x axis.
argument yLabel : an optional label plotted to the right of the y axis.
argument xSize, ySize : specifies the x and y size, in inches, of the canvas (the full figure area). Default values agree with the Pyplot defaults.
Note: the title and labels may contain special LaTeX symbols, e.g. simple math formulae enclosed by dollar signs $...$, that are interpreted by the Pyplot package. A backslash \ needs to be doubled to \\, however, since else it will be intepreted as an escape sequence.

void HistPlot::add(const Hist& hist, string style = "h", string legend = "void")  
add a histogram to the current plot frame, where
argument hist : is the name of the histogram to be plotted,
argument style : tells how the histogram data will be represented in the figure, where default "h" is histogram style, "-" full lines, "--" dashed ones, "-." dash-dotted ones, "." points, "o" circles, "+" and "x" crosses, "*" stars, and so on as described in the Pyplot manual; alternatively to "h" the argument "e" can be specified which will include error bars on the histogram; additionally a colour may also be specified as described earlier, separated by a comma from the symbol; either of both can be used, e.g. "h,r", ",magenta", "+,", "--" or just ""; and
argument legend : is the text written in association with the plotting symbol, where the default void is replaced by the title provided when the histogram was booked, and LaTex input may be used as described in the note above.

void HistPlot::addFile(string file, string style = "o", string legend = "void", string xyerr = "")  
add data in an already existing file to the current plot frame, where
argument file : is the name of the file to be plotted,
argument style : tells how the data will be represented in the figure, just like in the add method above, but by default intended for discrete points rather than histograms or connected lines;
argument legend : is the text written in association with the plotting symbol, where the default void is replaced by the name of the input file, and LaTex input may be used as described in the note above for frame; and
argument xyerr : can be used to provide error bars for a point, where the options are to give an x to plot a symmetric error bar in the x direction, an X for an asymmetric ditto, a y for a symmetric error bar in the y direction, and a Y for an asymmetric ditto, and where both x and y can be combined or omitted as desired.
Note: this method is primarily intended to overlay data points over existing histograms, and not for standalone usage. The data file is expected to contain at least two columns, where the first two give the (x,y) coordinates of the points. Then up to four error columns can follow, first the ones pertaining to the x error, if any, and then to y. Symmetric errors require one column with length in both directions, whereas asymetric gives first the length downwards and then the length upwards, both as non-negative numbers. For example, to plot the point (1.0, 2.0) with a symmetric error +/-0.5 in the x-direction and an asymmetric error +0.2/-0.5 in the y-direction, the file should contain the row 1.0 2.0 0.5 0.2 -0.5 and the value of xyerr might be "xY".

void HistPlot::plot( bool logY = false, bool logX = false)  
write the Pyplot commands for the current frame, as defined by the input from the two methods above, and additionally write the data files needed when the Python code is to be run.
argument logY : By default a linear y scale is used, but when true the y scale is instead logarithmic, except that a linear scale is used the very last step to zero bin content.
argument logX : By default a linear or logarithmic x scale is used depending on how the histograms were booked, i.e. by default linear, and this should not be overridden. For the case that the only input comes from addFile commands, however, this option gives the possibility to pick a logarithmic x scale instead of the default linear one. (Presumably you would then use the plotvariant below.) Note the confusing order of the two arguments, which is caused by the first presumably being much more commonly used than the second.

void HistPlot::plot( double xMin, double xMax, double yMin, double yMax, bool logY = false, bool logX = false)  
a variant of the plot method above, where the x and y ranges of the plot frame can be set by hand instead of calculated from the histogram information.

void HistPlot::plotFrame( string frameName, const Hist& hist, string title = "", string xLabel = "", string yLabel = "", string style = "h", string legend = "void", bool logY = false)  
is an omnibus version combining the three above methods when only a single histogram is to be plotted in a frame. The arguments and their meaning is the same as already explained for them, and the order in which they appear is almost the same, the exception being that the two compulsory arguments frameName and hist have been put at the beginning.