METAL 2010
EDITED BY: 
PAUL MARDIKIAN 
CLAUDIA CHEMELLO
CHRISTOPHER WATTERS
PETER HULL
ISBN 978-0-9830399-0-7 
© 2010 Clemson University. All rights reserved.
Photograph: Detail from ‘Bridge No. 2’ from the series Rust Never Sleeps, John Moore, 1996
INTERNATIONAL CONFERENCE ON METAL CONSERVATION
INTERIM MEETING OF THE INTERNATIONAL COUNCIL  
OF MUSEUMS COMMITTEE FOR CONSERVATION  
METAL WORKING GROUP
OCTOBER 11-15, 2010
CHARLESTON, SOUTH CAROLINA, USA
178       M E TA L  2 0 1 0  :  C H A R L E S T O N ,  S O U T H  C A R O L I N A ,  U S A
been dealt with via formal or informal inter-laboratory 
analyses in which common reference materials are 
measured (e.g., (Glascock 1999, Hein 2002)). Since W. 
T. Chase’s signal 1974 paper, ‘Comparative analysis of 
archaeological bronzes’ (Chase 1974), we know of only 
one other published study that has attempted to evaluate 
the inter-laboratory reproducibility of quantitative XRF 
on historic copper alloys (Northover and Rychner 1998). 
Neither of these publications focused primarily on XRF, 
but rather on reproducibility between techniques.  Both 
publications also focused on copper alloys where the 
primary alloying metals were tin and lead.
Building on an earlier workshop and XRF round-
robin organized by the Getty Conservation Institute, 
the National Gallery of Art in Washington hosted a 
Introduction 
Since at least the late 1950s, a number of papers have 
been published that report quantitative analyses of 
historic copper alloys based on X-ray Fluorescence 
Spectroscopy (XRF).  With the recent widespread 
introduction and adoption of relatively low-cost, 
portable XRF spectrometers, the pace of publication 
of such data is increasing and is likely to accelerate 
further. Although we welcome these advances, the rapid 
proliferation and publication of XRF data raises a host 
of important questions concerning the accuracy and 
inter-laboratory comparability and reproducibility of 
published data. While within laboratory conclusions 
based on quantitative XRF analysis may be interesting 
and instructive, comparing data between laboratories, or 
even between different instruments within a laboratory, 
can be problematic.  Traditionally, such issues have 
AN EVALUATION OF INTER-LABORATORY REPRODUCIBILITY FOR 
QUANTITATIVE XRF OF HISTORIC COPPER ALLOYS 
Arlen Heginbotham1*, Aniko Bezur2, Michel Bouchard3, Jeffrey M. Davis4, Katherine Eremin5, 
James H. Frantz6, Lisha Glinsman7, Lee-Ann Hayek8, Duncan Hook9, Vicky Kantarelou10, Andreas 
Germanos Karydas10, 11, Lynn Lee5, Jennifer Mass12, Catherine Matsen12, Blythe McCarthy13, Molly 
McGath13, Aaron Shugar14, Jane Sirois15, Dylan Smith7, Robert J. Speakman16
1 Decorative Arts and Sculpture Conservation Department 
J. Paul Getty Museum 
1200 Getty Center Drive, Suite 1000 
Los Angeles, CA 90049-1687 
USA
2 The Museum of Fine Arts, Houston 
3 The Getty Conservation Institute 
4 The National Institute of Standards and Technology 
5 Harvard Art Museum 
6 The Metropolitan Museum of Art 
7 National Gallery of Art, Washington, D.C. 
8 The Smithsonian Institution 
Abstract 
This paper reports the results of a study conducted to evaluate the current state of inter-laboratory reproducibility 
when conducting quantitative XRF analysis of historic copper alloys. Fourteen institutions, primarily from the museum 
community, participated in the study, using a total of 19 X-ray fluorescence instruments. The design of the study 
was based largely on ASTM standard E1601, Standard Practice for Conducting an Interlaboratory Study to Evaluate 
the Performance of an Analytical Method. In addition to addressing overall inter-laboratory reproducibility, we also 
attempt to evaluate the accuracy of individual laboratories. By determining correlations between accurate results and 
experimental methods and procedures, we are able to propose recommendations regarding best practice and ways in 
which reproducibility might be improved.
Keywords: inter-laboratory reproducibility, X-ray fluorescence, copper alloys, fundamental parameters, ASTM E1601
9 The British Museum 
10 NCSR “Demokritos” – Institute of Nuclear Physics – Athens, 
Greece 
11 Nuclear Spectrometry and Applications Laboratory, 
International Atomic Energy Agency (IAEA) – Vienna, Austria
12 The Winterthur Museum/ University of Delaware 
13 The Freer Gallery of Art and Arthur M. Sackler Gallery 
14 Buffalo State College 
15 The Canadian Conservation Institute 
16 Museum Conservation Institute, Smithsonian Institution 
* Corresponding author: aheginbotham@getty.edu 
A N  E V A L U AT I O N  O F  I N T E R - L A B O R AT O R Y  R E P R O D U C I B I L I T Y        179
proprietary software for quantification, many others 
used or created customized solutions, ranging from 
spreadsheet-based analysis, to complete programs 
written in-house, to the use of X-ray analysis software 
available on the Internet.
The design of the study was based largely on ASTM 
standard E1601, Standard Practice for Conducting an 
Interlaboratory Study to Evaluate the Performance of 
an Analytical Method, following Test Plan A.  Each 
participating laboratory was asked to analyze a set of 
12 samples of metal (designated A-L). The same sample 
set was circulated to each participant via a traceable 
shipper over the course of eight months. The test samples 
consisted of three types: 1- cuttings obtained from 
reference materials[2] (RMs) n=4; 2- pieces of historic 
metal, n=6; 3- small ingots prepared by the lead author, 
n=2. The range of elemental compositions included in 
these samples was tailored to imitate the broad range 
can be found in historic copper alloy artifacts from the 
Bronze Age through the 19th century.  A table presented 
in the Results section below provides brief descriptions 
of the 12 samples, their approximate compositions, and 
the range of concentrations determined for each element.
On each sample, a circular site was selected for analysis 
with a diameter of approximately 9 mm.   These sites 
were first flattened with 220-grit silicon carbide abrasive 
paper.  They were then polished with successively finer 
grades of Micro-mesh™ abrasive cloths, finishing with 
4,000-grit.  All polishing was done wet in ethanol, and 
fresh abrasive was used for each sample.  The sites 
designated for analysis were clearly circumscribed on 
each sample with a stylus, ensuring that the material 
analyzed would be the same across laboratories.  The 
samples were individually bagged and placed in a 
padded case for transport. Participants were asked not to 
touch or otherwise disturb the sample sites.  
Per ASTM standard E1601, each laboratory was asked 
to conduct triplicate analyses of each area on each 
sample according to their standard in-house procedures.  
Participants were asked to conduct analyses that would 
yield a result representative of the entire area.  In 
addition, the three measurements were to be acquired 
in immediate succession with as little variation in 
procedure as possible.
Data Recording and Accumulation  
for Analysis  
Each participating laboratory completed a standardized 
reporting form in spreadsheet format for each instrument 
used.  If the same instrument was used in conjunction 
with more than one quantification method, a separate 
form was completed for each method employed.  For 
every analysis, participants were asked to report on a 
minimum of 12 elements.  These elements were Mn, Fe, 
Ni, Cu, Zn, As, Ag, Cd, Sn, Sb, Pb and Bi.  Space was 
provided to report on additional elements if they were 
detected.  If a quantitative result for a requested element 
could not be obtained, analysts were asked to choose 
from the following responses: BDL – below detection 
limit/not detected; Trace – element present in a small 
amount but not quantifiable; Present – element present in 
a significant amount but not quantifiable; N/A – element 
seminal meeting in 2007 of representatives from seven 
museums to address issues surrounding the sharing 
and comparability of quantitative XRF data between 
institutions. That meeting, sponsored by Robert H. Smith 
and the Center for Advanced Study in the Visual Arts, 
focused on these issues particularly as they relate to the 
analysis of Renaissance bronze sculpture.  Moderated 
by then Senior Curator of Sculpture, Nicholas Penny 
and Head of the Object Conservation Department, 
Shelley Sturman, the participants agreed that the ability 
to compare data would be valuable, but enumerated a 
host of problems and obstacles to be overcome before 
meaningful inter-laboratory comparisons could be made. 
This study is a direct product of that encounter.
The program described here is an attempt to evaluate 
the current state of inter-laboratory reproducibility 
of quantitative XRF analysis of copper alloys.  We 
conducted, interpreted and summarized data generated 
from a carefully designed study informed by ASTM 
standard methodology (ASTM 2006, ASTM 2003). By 
quantifying the extent of reproducibility, we hope to 
provide valuable quantitative guidelines for practitioners 
who might wish to compare their own quantitative data 
with that generated by other laboratories, or who might 
wish to pursue meta-studies based on the work of many 
laboratories.
Our study sought participation primarily from 
laboratories in the museum community whose interests 
include a focus on historic copper alloys. In addition, 
we sought to include a variety of instrument types, 
supported by a variety of quantification procedures and 
software.  
In addition to addressing overall inter-laboratory 
reproducibility, we also attempt to evaluate the 
accuracy of individual laboratories.  By determining 
correlations between accurate results and experimental 
methods and procedures, we are able to propose some 
recommendations regarding best practice.
Methods
Research Design 
Seventeen institutions agreed to participate in the study.  
Of these, many hoped to produce multiple data sets by 
using more than one instrument or by processing data 
from one instrument using multiple methods. Therefore, 
the maximum number of data sets anticipated was 30.  
In order to maintain anonymity throughout the study, 
each institution was assigned a laboratory number for 
each anticipated data set.  This number was known only 
to the members of that institution and to the program 
coordinator (Heginbotham).  Fourteen, or 82%, of the 
institutions turned in complete results and the total 
number of data sets included in the study is 19. In one 
case, the same instrument was used to produce three 
data sets by processing the same raw spectra using three 
different methods[1].
Eight instruments were used in the study.  These include 
Bruker/Keymaster Tracer, Bruker/Roentec Artax, EDAX 
Eagle 3, Elva-X light, Innov-X XT-260, Niton Gold, 
Spectrace Omega 5, and laboratory-built models.  While 
many laboratories chose to use the manufacturer’s 
180       M E TA L  2 0 1 0  :  C H A R L E S T O N ,  S O U T H  C A R O L I N A ,  U S A
Once errors were corrected, reproducibility statistics 
were calculated for each of the 12 requested elements in 
each sample. The reproducibility index (R) is a measure 
of precision and represents the expected variability of 
results when a method is used in different laboratories.  
Specifically:
Use R to predict how well your results should agree 
with those from another laboratory: First, obtain a 
result…, then add R to, and subtract R from, this 
result to form a concentration confidence interval. 
Such an interval has a 95% probability of including 
a result obtainable by the method should another 
laboratory analyze the same sample. For example, a 
result of 46.57% was obtained. If R for the method 
at about 45% is 0.543, the 95% confidence interval 
for the result (that is, one expected to include the 
result obtained in another laboratory 19 times out of 
20) extends from 46.03 to 47.11% (ASTM 2003). 
The reproducibility index was calculated as:
R = 2.8{(s
x
 )2 + [(Σ(s2) / p) (n-1) / n]}½ 
where s = the standard deviation of each laboratory’s 
replicate measurements and n = the number of replicates 
(in this case, three[3]). 
Finally, the percent relative reproducibility index (Rrel%), 
which represents R as a percentage of the overall mean, 
was calculated according to the formula:
Rrel% = 100R /X
Lower Limits 
A lower limit (L) was calculated for each element (with 
the exception of copper) below which the method is 
not considered reliable.  This calculation was made 
according to the formula 
L = 100R / emax
where R = element reproducibility index determined for 
the sample with the lowest concentration of the specific 
element, and emax = maximum acceptable percent 
relative error.  In this case, emax was set to 50% based on 
ASTM guidelines.
Accuracy of Overall Median  
It was hypothesized that the overall group medianχ 
would likely be a good approximation of the true 
concentration of each element in a sample. If true, 
thenχ values could be used to gauge the accuracy of 
individual laboratories for samples A-H. In order to verify 
this hypothesis, the accuracy ofχ values was evaluated 
for the four RMs (samples I-L). For each certified value 
(X) in the RMs, the percent error of the median was 
calculated:
% error = 100(χ - X) / X
Certified values that fell below the method’s calculated 
lower limit (L) for that specific element were not 
considered in evaluating accuracy.  The mean percentage 
error for all elements in the RMs was calculated using the 
absolute values of all percentage errors where X > L.
not analyzed for/not detectable by this instrument. Data 
for each sample from all laboratories were compiled in a 
master database for evaluation.  
The reporting form provided to the participants also 
requested extensive detail about the instrument, 
software, and procedures utilized in each laboratory. 
Participants were asked to provide information about 
their instrument manufacturer, model, anode material, 
and detector type.  Participants also reported on 
operating parameters, including voltage (kV), current 
(mA), measurement time, spot size, filters, typical 
number of live (valid) counts collected by the detector 
per second, and average dead time.  Participants 
also reported the software and methodology used for 
quantification. This included the full name and version 
of software, the type of method used, the number of 
standards used, and the frequency of calibration checks 
and recalibration. In addition, participants were asked to 
report their errors and detection limits for each of the 12 
elements listed above, and to specify how these values 
were determined.
Assessment Methods
Reproducibility Statistics 
In general, our evaluation followed the guidelines 
presented in the ASTM E1601 (test plan A).  For each 
set of triplicate results for a particular element in an 
individual sample, the mean result (x ) was calculated. 
The overall group mean (X ) was then calculated as 
X  =  (Σ  x) /  p
where p = the number of laboratories reporting a 
quantitative result for that element.  For eachx, the 
laboratory difference (d) was calculated as
d = x - X
The standard deviation of all laboratory differences (s
x 
) 
was then calculated for each element in each sample as
s
x = [Σ(d
2) / (p-1)]½ 
These preliminary calculations allowed the calculation 
of a between-laboratory consistency statistic, designated 
as h, that provides a normalized measure of the 
difference between the reported result and the overall 
mean value of all laboratories’ results for the same 
element and standard:  
h = d / s
x
Comparison of the h statistics to a table of critical 
values allowed outlying results, that is, results that 
deviated significantly from the overall group mean, to be 
identified and flagged for follow-up.  Laboratories with 
flagged results were contacted and asked to check their 
records to see if any errors in procedure, analysis, or 
transcription of results could be identified.  If any such 
errors were identified, the data were corrected, but if no 
errors were found, the data were retained as originally 
reported.  Of 1,718 h statistics that were calculated, 
48 (2.8%) were flagged as identifying outliers and 20 
corrections were made by four laboratories.
A N  E V A L U AT I O N  O F  I N T E R - L A B O R AT O R Y  R E P R O D U C I B I L I T Y        181
Mn and Cd were sporadically reported by only a few 
laboratories, making any meaningful comparisons or 
calculation impossible. Consequently, discussion of these 
elements is omitted. 
Correlations Between Accuracy Scores  
and Methods 
Accuracy scores were compared with the descriptions 
of instrument specifications, operating parameters and 
methodology provided in the laboratories’ reporting 
forms. In an attempt to identify ‘best practices’, we 
sought to identify characteristics that were common to 
the most accurate laboratories.  No attempts were made 
to be quantitative in this assessment. Rather, general 
correlations were identified by simple graphical plotting 
of the data.
Results
Table 1 provides a summary of laboratory data collected 
in the reporting forms. Table 2 gives brief descriptions 
of the 12 samples along with their approximate 
compositions, and the range of concentrations covered 
by the set as a whole. For samples A-H, the values are 
based on the overall group median; for samples I-J, 
values are as listed by the manufacturer of the RM.  
Lower limits for samples A-H were defined as described 
in METHODS. The complete quantitative data reported 
by all laboratories is available at the following address:  
http://www.getty.edu/museum/conservation/papers.html
Reproducibility Statistics and Lower Limits 
Summary statistics as per ASTM for the eight most 
commonly identified elements are presented as a group 
in Table 3.  For each element, the samples, or test 
materials, are sorted by overall mean weight percent.  
The method’s lower limit (L) for each element is shown 
on the right side of the relevant sub-table except in the 
case of copper, for which no lower limit was calculated.  
A dashed line through the center of each sub-table 
separates materials whose overall mean concentration 
falls below L (above the line) from those where the 
mean is greater than L (below the line).   The latter 
group constitutes the samples for which the method is 
considered valid.  The mean value of the Rrel% statistics for 
these samples is shown at the bottom right of each sub-
table.  This statistic provides the most succinct summary, 
for each element, of the analytical reproducibility 
that may be currently anticipated within this group of 
laboratories, based on a 95% confidence interval.
Evaluation of Accuracy 
Data for the four RMs are presented in Table 4. This table 
shows the group’s overall median (χ ) for all elements 
where reference or certified values are given. Percent 
errors are shown for elements where χ > L. The results 
show that, on average, χ falls within 5% of the certified 
value in cases where χ lies in the range of validity for 
the method. It was determined thatχ, if greater than 
L, could be used as a reasonable approximation of the 
true value for the purposes of evaluating the accuracy of 
individual laboratories.
Ranking of Laboratories 
The accuracy of each laboratory/instrument combination 
was evaluated on an element-by-element basis. For 
each quantitative result from a given laboratory, the 
laboratory difference from the assumed ‘true’ value (dt) 
was calculated. For the four RMs (samples I-L), this was 
calculated as 
dt = x – X
(recall that x = the laboratory’s mean result and X = the 
certified value). For the non-reference samples (A-H) dt 
was calculated as 
dt = x - Xm
where Xm = the median value of all laboratory results. 
If Xm < L (the method’s lower limit as defined above), 
then Xm was considered to be unreliable as a measure 
of the true value; therefore no d values were calculated 
and the element was not used for ranking purposes.  
As an added precaution, if fewer than 10 laboratories 
reported data for an element in a given standard, no d 
values were calculated and the element was not used for 
ranking purposes.
A normalized accuracy statistic (ha) was then calculated 
by dividing the laboratory difference by the standard 
deviation of laboratory differences. 
ha = dt /(Σ(dt
2) / (p-1))½
where p = the number of laboratories reporting a 
quantitative results for the element in the given sample.
For each laboratory, all ha values for a given element 
were combined to generate a mean accuracy score 
(Selement) for that element according to the formula
Selement = Σ(ha
2) / n
where n = the number of quantitative results reported 
for the given element for all 12 samples.  Scores close 
to zero reflect results that are consistently close to the 
assumed true value[4].
All 19 laboratories reported quantitative results for 
Cu, Zn, Sn and Pb (hereafter referred to as the ‘major 
elements’).  An aggregate score for major elements (Smajor) 
was calculated:
Smajor =  SCu + SZn + SSn + SPb
Only 15 laboratories reported quantitative results for all 
four of the elements Fe, Ni, As and Sb  (hereafter referred 
to as the ‘minor elements’). An aggregate score for minor 
elements (Sminor) was calculated for these laboratories:
Sminor =  SFe + SNi + SAs + SSb
Only eight laboratories reported quantitative results for 
Bi, so SBi was not included in the calculation of Sminor.  
SAg also was rejected for inclusion in Sminor because the 
reproducibility of results for Ag was so poor that the 
median results (Xm) were not considered to be valid 
indicators of the true value.
182       M E TA L  2 0 1 0  :  C H A R L E S T O N ,  S O U T H  C A R O L I N A ,  U S A
Laboratory 
Number
Tube 
target
Detector 
Type
kV mA
Acqusition 
Time (s)
Spot 
size 
(mm)
Filters 
(element)
Counting 
rate (cps)
Quantification 
method
Number of 
Standards 
1 Rh PIN 40 2.5 90 6 Al Ti Cu 3500 Empirical 27
2 W PIN 45 7.5 100 8 Ni Al 4800 FP 0
3 Re PIN 40 1 400 6 Al Ti 5000 FP w/stds 29
6 Rh SDD 50 0.6 600 0.05 Ti Co Pd 8000 FP w/stds 19
7 Re PIN 40 1 400 6 Al Ti 6000 FP 0
8 Mo SDD 50 0.8 300 0.9 None 30000 FP w/stds 4
9 Rh Si-Li 40 0.1-0.3 300 0.054 None 10000 FP 0
10 Rh Si-Li 45 1 100 8.5 Rh 6800 FP w/stds 8
12 Mo SDD 50 0.6 150 0.07 none 4800 Empirical 12
13 Au SDD 40 40 400 8 Ag 95000 FP w/stds 8
14 Rh PIN 40 1.4 120 6 Al Ti 6000 Empirical 73
15 Rh PIN 40 1.8 60 6 Al Ti 6300 Empirical 45
18 Rh PIN 40 1.8 180 6 Al Ti 7300 Empirical 46
19 Rh PIN 40 0.1 600 2.6 Ni V 700 FP w/stds 19
22 Re PIN 40 1.5 400 6 Al 6500 Empirical 36
23 W SDD 50 0.2 200 1.5 Ni 16000 Empirical 5
24 Ag PIN 35 6 60 10 Al 5000 FP w/stds 5
27 Rh PIN 40 0.003 300 5 Al Ti Cu 6250 Empirical 125
28 Rh SDD 50 0.35 200 1.5 None 60000 Empirical 15
Table 1. Summary of laboratory data. 
Sample: A B C D E F G H I J K L
D
e
s
c
ri
p
ti
o
n
C
h
in
e
s
e
 C
o
in
 
(u
n
k
n
o
w
n
 d
a
te
)
It
a
li
a
n
 U
p
h
o
ls
te
ry
 
T
a
c
k
 (
1
7
th
 C
e
n
tu
ry
?
)
B
ri
ti
s
h
 D
o
o
r 
K
n
o
b
 
(1
8
th
 C
e
n
tu
ry
?
)
A
m
e
ri
c
a
n
 
S
c
re
w
d
ri
v
e
r 
F
e
rr
u
le
 
(1
9
th
 C
e
n
tu
ry
)
B
ri
ti
s
h
 A
u
g
e
r 
C
o
v
e
r 
P
la
te
 (
1
9
th
 C
e
n
tu
ry
)
L
a
b
o
ra
to
ry
-c
a
s
t 
In
g
o
t
L
a
b
o
ra
to
ry
-c
a
s
t 
In
g
o
t
D
u
tc
h
 E
a
s
t 
In
d
ia
 
C
o
m
p
a
n
y
 C
o
in
 
(1
7
5
4
)
B
ra
m
m
e
r 
C
9
3
4
 (
R
M
)
C
IT
F
 B
3
2
 (
C
R
M
)
M
B
H
 3
1
X
 B
2
7
 A
 
(C
R
M
)
B
N
F
 C
7
1
.3
4
-3
 (
C
R
M
)
M
in
im
u
m
M
a
x
im
u
m
Fe 0.55 0.22 <0.17 <0.17 0.41 0.82 0.41 <0.17 0.01 0.1 0.31 0.29 0.01 0.82
Ni <0.35 0.35 <0.35 <0.35 <0.35 0.96 1.1 <0.35 0.49 1.49 0.042 - 0.04 1.5
Cu 71 82 75 85 70 53 72 98 82.64 74.85 78.2 87.23 53 98
Zn <0.79 9.3 22 3.6 28 34 3.0 <0.79 0.17 1.15 19.9 1.55 0.17 34
As 0.47 <0.25 <0.25 <0.25 0.29 2.52 0.93 0.25 - 0.0056 0.03 0.18 0.01 2.5
Ag <0.15 <0.15 <0.15 <0.15 <0.15 <0.15 0.17 <0.15 - - - 0.025 0.03 0.17
Sn 4.1 4.6 <0.27 8.5 0.53 2.8 16 <0.27 8.1 5.9 0.92 8.2 0.53 16
Sb 0.22 0.12 <0.12 0.13 <0.12 3.0 1.9 0.87 0.14 0.13 0.04 0.071 0.04 3.0
Pb 24 3.3 1.9 2.3 <1.22 1.4 3.9 <1.22 8.45 16.1 0.24 2.47 0.24 24
Bi <0.12 <0.12 <0.12 <0.12 <0.12 0.32 0.18 <0.12 - - 0.055 0.029 0.03 0.32
Table 2. Compositions and descriptions of the 12 samples (A-L) used in the study.  For samples A-H, values are based on the overall group median; for 
samples I-J, values are as certified by the manufacturer.  Lower limits for samples A-H were defined as described in METHODS.
A N  E V A L U AT I O N  O F  I N T E R - L A B O R AT O R Y  R E P R O D U C I B I L I T Y        183
parameter (FP); fundamental parameter calibrated 
with standards (FP w/standards); and algorithms using 
empirical coefficients (empirical).
FP methods are based on mathematical models that 
predict the intensity of fluorescent radiation from a 
sample of known composition.  The models incorporate 
knowledge of many instrument parameters, such as 
incidence and take-off angles (for both anode and 
sample), anode material, detector area and thickness, 
voltage, attenuators (such as windows, filters and air 
path), etc.  FP models generally account for matrix 
effects, such as absorption and secondary fluorescence 
(in which some portion of the characteristic photons 
Ranking of Laboratories 
Laboratories Smajor and Sminor scores are shown in Table 5, 
ranked in order of highest to lowest accuracy. 
Correlations with Performance
Quantification Method 
Clearly, the strongest correlation between laboratory 
characteristics and accuracy was based on the type of 
method employed to convert raw elemental intensities 
into a quantitative result (see Figures 1a, 1b and Table 5). 
Three major categories of method were reported by the 
participating laboratories: standardless fundamental 
Iron - Statistical Summary Arsenic - Statistical Summary
Test Material
Number of 
laboratories 
(n)
Overall Mean 
( X)
Reproducibility 
Index (R)
Percent Relative 
Reproducibility 
Index (Rrel%)
Test Material
Number of 
laboratories 
(n)
Overall Mean 
( X)
Reproducibility 
Index (R)
Percent Relative 
Reproducibility 
Index (Rrel%)
I 7 0.023 0.083 353 K 10 0.041 0.074 179
H 11 0.029 0.062 216 I 5 0.054 0.142 262
J 17 0.126 0.192 153 J 4 0.067 0.280 417
C 18 0.135 0.156 115 D 11 0.144 0.573 399
D 19 0.151 0.213 141 C 14 0.146 0.231 159
B 19 0.236 0.303 128 B 7 0.176 0.764 435
L 19 0.283 0.356 126 L 13 0.213 0.746 350
K 19 0.363 0.412 113 H 16 0.247 0.249 101
E 19 0.420 0.459 109 Mean Rrel% E 16 0.291 0.337 116 Mean Rrel%
G 18 0.427 0.569 133 for X>L A 10 0.444 0.489 110 for X>L
A 19 0.592 0.696 118 G 15 0.908 0.873 96
F 19 0.902 1.101 122 F 16 2.558 3.261 127
Nickel - Statistical Summary Tin - Statistical Summary
Test Material
Number of 
laboratories 
(n)
Overall Mean 
( X)
Reproducibility 
Index (R)
Percent Relative 
Reproducibility 
Index (Rrel%)
Test Material
Number of 
laboratories 
(n)
Overall Mean 
( X)
Reproducibility 
Index (R)
Percent Relative 
Reproducibility 
Index (Rrel%)
K 12 0.074 0.177 238 H 5 0.053 0.136 255
C 10 0.082 0.240 292 C 13 0.112 0.132 118
E 15 0.092 0.174 189 E 19 0.529 0.235 44
D 7 0.100 0.281 281 K 19 0.866 0.315 36
A 12 0.145 0.211 146 F 18 3.092 2.176 70
L 4 0.164 0.584 356 A 18 4.320 1.660 38
H 14 0.197 0.267 136 B 19 4.687 1.295 28
B 18 0.378 0.273 72 J 19 5.951 2.330 39
I 17 0.462 0.242 52 Mean Rrel% D 19 8.543 1.804 21 Mean Rrel%
G 17 1.040 0.582 56 for X>L I 19 8.554 2.225 26 for X>L
F 18 1.066 0.732 69 L 19 8.608 3.953 46
J 18 1.475 0.820 56 G 18 17.166 9.082 53
Copper - Statistical Summary Antimony - Statistical Summary
Test Material
Number of 
laboratories 
(n)
Overall Mean 
( X)
Reproducibility 
Index (R)
Percent Relative 
Reproducibility 
Index (Rrel%)
Test Material
Number of 
laboratories 
(n)
Overall Mean 
( X)
Reproducibility 
Index (R)
Percent Relative 
Reproducibility 
Index (Rrel%)
F 19 53.249 10.289 19 E 5 0.026 0.064 247
E 19 69.855 3.031 4 C 6 0.029 0.064 220
A 19 70.508 13.163 19 K 8 0.030 0.060 199
G 18 71.481 8.980 13 I 13 0.126 0.252 200
J 19 73.935 7.083 10 B 13 0.156 0.400 257
C 19 75.236 3.123 4 D 12 0.157 0.259 165
K 19 78.234 2.590 3 J 13 0.171 0.418 244
I 19 81.758 2.701 3 L 11 0.177 0.932 528
B 19 81.795 3.930 5 Mean Rrel% A 14 0.211 0.277 131 Mean Rrel%
D 19 85.266 2.493 3 for X>L H 16 0.882 0.473 54 for X>L
L 19 86.209 6.725 8 G 16 1.857 0.738 40
H 19 98.130 2.694 3 F 17 3.020 1.450 48
Zinc - Statistical Summary Lead - Statistical Summary
Test Material
Number of 
laboratories 
(n)
Overall Mean 
( X)
Reproducibility 
Index (R)
Percent Relative 
Reproducibility 
Index (Rrel%)
Test Material
Number of 
laboratories 
(n)
Overall Mean 
( X)
Reproducibility 
Index (R)
Percent Relative 
Reproducibility 
Index (Rrel%)
A 6 0.209 0.710 340 H 17 0.178 0.609 343
H 9 0.240 0.396 165 K 19 0.269 0.397 148
I 12 0.315 0.401 127 E 19 1.033 0.693 67
J 19 1.132 1.016 90 C 19 1.939 0.670 35
L 19 1.653 0.927 56 F 19 2.234 6.701 300
G 18 3.005 1.127 38 D 19 2.283 0.659 29
D 19 3.669 1.121 31 L 19 2.860 1.909 67
B 19 9.376 1.799 19 B 19 3.247 1.210 37
K 19 19.873 1.762 9 Mean Rrel% G 18 3.816 3.084 81 Mean Rrel%
C 19 22.312 2.003 9 for X>L I 19 8.650 2.642 31 for X>L
E 19 27.733 1.968 7 J 19 17.346 10.256 59
F 19 33.600 6.026 18 A 19 24.628 13.570 55
a. Lower limits are only calculated where the element is to be analyzed near the lower end of its effective concentration range
Calculated 
Lower Limit (L) 
0.792
31%
Calculated 
Lower Limit (L) 
Calculated 
Lower Limit (L) 
n/a                    
(see note a)
8%
0.354
61%
Calculated 
Lower Limit (L) 
Calculated Lower 
Limit (L) 
0.246
110%
0.165
121%
Calculated Lower 
Limit (L) 
0.271
40%
Calculated Lower 
Limit (L) 
0.120
185%
Calculated Lower 
Limit (L) 
1.217
77%
Table 3. Statistical summaries; for each element, the samples are sorted by overall mean weight percent.
184       M E TA L  2 0 1 0  :  C H A R L E S T O N ,  S O U T H  C A R O L I N A ,  U S A
theoretical accounting for matrix effects as discussed 
above. However, these methods also perform corrections 
to the model, using spectra generated by the instrument 
in question, from reference standards of composition 
similar to that of the analyte.  The corrections can be 
performed in a variety of ways (discussion of which is 
beyond the scope of this paper), but by and large they 
attempt to account for instrument-related factors (de 
Vries and Vrebos 2002).
Empirical calibrations are derived through the 
measurement of standards that are similar to the 
unknown.  Ideally, the compositional standards 
have the same elements as the unknown, although 
the composition may be different.  Comparing each 
elemental fluorescence intensity in the standard to the 
corresponding composition and fitting a regression 
between known points, analysts can interpolate between 
known values.  The fluorescence intensity of the 
excited by incident x-rays cause enhanced fluorescence 
in the sample) through theoretically-derived 
mathematical equations. The complex calculations 
involved in this method rely on knowledge of many 
physical constants, such as mass-attenuation coefficients, 
fluorescence yields, absorption jump ratios, relative line 
intensities, absorption edges, etc. (de Vries and Vrebos 
2002). Some FP applications allow for the use of pure 
element standards to help model the spectral distribution 
of the tube output (de Viguerie et al. 2009) or to model 
transmission efficiency by polycapillary lenses (Karydas 
et al. 2008).  The use of pure element standards in this 
manner is still considered ‘standardless’ FP for the 
purposes of this study, as the standards are not used 
directly to generate scaling coefficients for analytes.
FP w/standards methods can take several forms.  As 
the name implies, they are based on mathematical 
predictions of fluorescent intensity and provide a 
Sample I  
(C934)
# of Labs 
(p)
Reference 
Value
Overall 
Median ( !")
% error 
Fe 7 0.01 0.01 0.0% * 
Ni 17 0.49 0.46 -7%
Cu 19 82.64 81.91 -1%
Zn 12 0.17 0.29 69% * 
Sn 19 8.07 8.46 5%
Sb 13 0.14 0.10 -27%
Pb 19 8.45 8.82 4%
Sample J 
(CITF B32)
# of Labs 
(p)
Certified 
Value
Overall 
Median ( !")
% error 
Fe 17 0.10 0.11 12% * 
Ni 18 1.49 1.50 1%
Cu 19 74.85 74.53 -0.4%
Zn 19 1.15 1.10 -4%
As 4 0.0056 0.0328 487% * 
Sn 19 5.92 5.78 -2%
Sb 13 0.13 0.12 -5%
Pb 19 16.10 16.76 4%
Sample K 
(MBH 31X 
B27 A) 
# of Labs 
(p)
Certified 
Value
Overall 
Median ( !")
% error 
Mn 11 0.045 0.046 2%
Fe 19 0.31 0.33 7%
Ni 12 0.042 0.054 29% * 
Cu 19 78.2 78.4 0.3%
Zn 19 19.9 19.78 -1%
As 10 0.03 0.04 39% * 
Sn 19 0.92 0.84 -9%
Sb 8 0.04 0.03 -19% * 
Pb 19 0.24 0.24 -1% * 
Bi 8 0.055 0.046 -17% * 
Sample L 
(BNF C71.34-
3)
# of Labs 
(p)
Certified 
Value
Overall 
Median ( !")
% error 
Mn 12 0.05 0.05 0.0%
Fe 19 0.29 0.25 -13%
Cu 19 87.230 86.592 -1%
Zn 19 1.55 1.62 5%
As 13 0.18 0.17 -6%
Ag 10 0.025 0.038 53% * 
Sn 19 8.20 8.43 3%
Sb 11 0.071 0.119 67% * 
Pb 19 2.47 2.76 12%
Bi 4 0.029 0.025 -13% * 
Mean error  (median > L) 5%
* certified value below L
Table 4. Comparison of certified values to group medians.
Ranking 
(major 
elements)
SCORE 
(Smajor)
Lab #
Quant 
Method
1 0.2 13 FP w/stds
2 0.6 24 FP w/stds
3 0.7 6 FP w/stds
4 0.9 19 FP w/stds
5 0.9 3 FP w/stds
6 1.2 8 FP w/stds
7 1.5 2 FP
8 1.8 23 Empirical
9 2.2 15 Empirical
10 3.2 14 Empirical
11 3.3 10 FP w/stds
12 3.7 28 Empirical
13 4.7 27 Empirical
14 5.7 1 Empirical
15 5.7 18 Empirical
16 9.1 12 Empirical
17 10.1 7 FP
18 11.1 22 Empirical
19 14.6 9 FP
Ranking 
(minor 
elements)
SCORE 
(Sminor)
Lab #
Quant 
Method
1 0.3 19 FP w/stds
2 0.7 3 FP w/stds
3 0.8 13 FP w/stds
4 0.8 6 FP w/stds
5 1.9 8 FP w/stds
6 2.7 15 Empirical
7 2.8 1 Empirical
8 2.9 28 Empirical
9 3.3 23 Empirical
10 3.9 14 Empirical
11 3.9 2 FP
12 4.5 18 Empirical
13 5.2 22 Empirical
14 8.4 9 FP
15 14.1 7 FP
Table 5. Laboratories’ Smajor and Sminor scores, ranked in order of highest 
to lowest accuracy.
A N  E V A L U AT I O N  O F  I N T E R - L A B O R AT O R Y  R E P R O D U C I B I L I T Y        185
Figure 1.  Performance scores (Smajor and Sminor) are plotted against selected laboratory characteristics.  High performance is reflected by a score close 
to zero (i.e. on the left side of the charts).
For both major and minor elements, it is very clear that 
laboratories using fundamental parameters software 
calibrated with standards (labs #3, 6, 8, 10, 13, 19, 24), 
performed consistently more accurately than laboratories 
using other methods.  Remarkably, of these seven 
laboratories, no two used the same type of instrument or 
the same brand of software. 
unknown is then compared to the calibrated regression, 
and the composition is derived. Empirical models 
typically account for absorption, secondary fluorescence 
and other matrix effects using empirically derived 
correction coefficients based on regression analysis (de 
Vries and Vrebos 2002).
186       M E TA L  2 0 1 0  :  C H A R L E S T O N ,  S O U T H  C A R O L I N A ,  U S A
among the labs using empirical methods, increasing the 
number of standards was not a guarantee of improved 
results, and the best performing laboratory of this 
group (for major elements) used only five standards per 
analysis.
Conclusions 
This study evaluates the current state of inter-laboratory 
reproducibility of quantitative XRF analysis of historic 
copper alloys based upon a representative group drawn 
from the art and archaeology community, primarily in 
the United States.  Nine members of the working group 
met for two days of intensive meetings to evaluate the 
results of the inter-laboratory study.  The conclusions and 
recommendations of this sub-group were reviewed and 
commented on by the wider group.  What follows is a 
summary of the overall findings of the working group.
Reproducibility 
The overall reproducibility of the group’s results is 
relatively poor. Even if one considers only results where 
the median result is above the calculated lower limits 
(L), the average percent relative reproducibility (Rrel%) is 
greater than 50% for all elements except Cu, Zn, and Sn.  
The ASTM standard practice stipulates that the working 
group may determine the degree of precision that may 
be considered acceptable for a given method, based 
on the context within which the results are to be used.  
However, an upper threshold for Rrel% is set at 50% above 
which the methods reproducibility must be considered 
unacceptable.  While not bound by ASTM guidelines, it 
was the consensus of the working group that the current 
reproducibility of XRF analysis of historic copper alloys 
within the art and archaeology community is, in general, 
not sufficient for any but the most broad comparisons to 
be made between laboratories. 
Two examples drawn directly from table 3 may help 
serve to illustrate the point.  Assume that a laboratory 
arrives at a result of 8.6% tin in a bronze alloy. Based 
of the current state of affairs, there is a 95% chance 
that another laboratory, measuring the same bronze, 
would arrive at a result somewhere between 4.6% and 
12.6%.  Similarly, for a result of 33% zinc in brass, the 
95% confidence interval ranges from 27% to 39%.  Also, 
considering that tin and zinc are among the elements 
with the best reproducibility in the study, the group 
agreed that concerted efforts should be made to improve 
the situation.
The reproducibility results reported here should evoke 
a strong sense of caution in those who might wish to 
publish data, compare their own data with that generated 
by other laboratories, or pursue meta-studies based on 
the work of multiple laboratories. 
The group also found that the lower limits determined 
by this study (below which reproducibility rapidly 
deteriorates) are considerably higher than could be 
wished for.  It was agreed that analyte concentrations 
below the lower limits are frequently of interest and 
significance to scientists engaged in the study of historic 
copper alloys.  
Among the laboratories using FP with standards, the 
majority had Smajor and Sminor scores that were tightly 
clustered near the perfect score of zero.  One of these 
laboratories (#10), however, appears to have performed 
noticeably less well than the others in the group 
(though their scores were still better than almost all 
other laboratories using empirical or standardless FP 
methods).  In their data reporting form, the analysts for 
this instrument noted that ‘While we did the analysis on 
this instrument, the stability of the instrument is doubtful 
and we would be cautious about reporting numbers from 
this instrument at the present time’. They also reported 
that the last full calibration of the instrument had been 
performed on the instrument more than four years 
ago.  These observations may explain the difference in 
performance.
Only seven of the 19 data sets in the study (2, 3, 6, 7, 
8, 13, 19) were able to consistently report quantitative 
results for the four major elements, plus six minor 
elements (Fe, Ni, As, Ag, Sb and Bi). Of these seven, all 
used FP or FP w/standards methodology. 
Detector Type 
Another laboratory characteristic that was evaluated 
for correlation with performance was detector type. 
The vast majority of laboratories in this study (84%) 
used silicon drift detectors (SDD) or silicon-PiN diode 
detectors (PIN). These are clearly the two dominant types 
of detectors on the market today.  Only three instruments 
in the study used lithium drifted silicon detectors (Si-Li). 
Figure 1e plots Smajor and Sminor against the three detector 
types in the study.  While more of the poorly performing 
laboratories seemed to use PIN or Si-Li detectors than 
SDDs, it is perhaps more significant to note that the 
six top ranked laboratories (#3, 6, 8, 13, 19, 24) were 
equally divided between PIN and SDD detectors.  It 
would appear then that very strong performance can 
be achieved with either of these detector types. Si-
Li detectors did not appear to perform as well as the 
other types, but with only two laboratories using these 
detectors, the results should not be given too much 
weight. 
Valid Counts per Analysis 
A surprising result of the study is that, within the range 
employed in this study, the total number of valid counts 
per analysis (vca) was not positively correlated with 
performance, either for major or minor elements.  The 
vca is given here as the product of the typical valid 
count rate per second (as reported by each laboratory 
and accounting for detector deadtime) and the number 
of seconds that the analysis was allowed to run.  For 
both major and minor elements (Figure 1c and 1d), 
there appears to be no correlation between vca and 
performance. It is interesting and perhaps instructive 
to note that if the laboratories are grouped by software/
analytical method, many of the best results for each 
group were attained with relatively low total counts, on 
the order of 300,000.  
Number of Standards 
The data also suggest that increasing the number of 
standards used for quantification does not necessarily 
improve the accuracy of results (Figure 1f).  In fact, the 
vast majority of the best performing laboratories used 20 
standards or fewer, and most used fewer than 10.  Even 
A N  E V A L U AT I O N  O F  I N T E R - L A B O R AT O R Y  R E P R O D U C I B I L I T Y        187
It was suggested by some members of the working group 
that the use of a common, open source FP software 
package[6] used in conjunction with a common and 
readily available set of reference materials could further 
improve the reproducibility of results within the group.  
Many participants expressed a desire to have a set of 
certified reference materials, replicated for the various 
institutions that wish to share data, which includes 
a range of major and trace elements appropriate for 
historic alloys.  Although a selection of available 
standards might fill a portion of this range, such a set 
would certainly require some standards to be newly 
manufactured. 
Error and Detection Limits
Reporting of error and detection limits was inconsistent 
among the participating laboratories.  Several 
laboratories did not report errors or detection limits at all. 
Many laboratories reported errors calculated from their 
software based on counting statistics.  While these values 
have meaning, they generally reflect the error associated 
with repeated analyses by the same instrument (or 
instrumental precision) rather than expected error with 
respect to the true value (instrumental accuracy). 
The laboratories that produced the most meaningful 
and reliable error values relative to true values did so by 
analyzing multiple reference materials and conducting 
a regression analysis of certified vs. calculated values.  
These laboratories used the ‘standard error’ associated 
with the regression to define meaningful confidence 
intervals relative to the estimated true value.  This 
strategy was employed both by laboratories using both 
FP w/standards and empirical methods.
Detection limits were, if anything, less consistently 
reported than errors.  Some laboratories did not report 
detection limits while others estimated them for 
selected elements based on experience with standards. 
Several participants derived their detection limits 
based on analysis of multiple reference materials with 
certified values at or near zero. A regression analysis 
was performed and a value of two or three times 
the standard error was used to estimate the nominal 
detection limit.  The consensus of the group was that this 
empirical approach provides useful results in a relatively 
straightforward manner though other means are possible 
(Ziebold 1967, Long and Winefordner 1983).
Other Suggestions
The working group suggests that, in instances where 
data are to be published or shared between laboratories, 
standard practice should include publication (perhaps 
separately) of a detailed and comprehensive reporting 
of the laboratory method along with the presentation 
of empirically derived error and detection limit values. 
In addition, it was suggested that publication of data 
include results for one or two control samples (e.g., 
reference materials analyzed during the analysis, but 
which are not part of the calibration).
In some areas, the raw data generated in this study 
has only been superficially evaluated and many more 
Quantification Method
Through this study, one common characteristic of 
higher-performing laboratories has become clear: the 
use of fundamental parameters software, calibrated with 
standards.  In comparison, all other factors examined in 
this study appear to be relatively poorly correlated with 
laboratory accuracy. The consensus of the group is that 
options should be explored for ways in which existing 
instruments that currently use empirical or standardless 
FP methods could be upgraded to use FP with standards. 
A sense of the magnitude of improvement that such a 
change, if widely adopted, might bring about can be 
gleaned from Table 6.  This presents the method’s lower 
limits (L) and the percent relative reproducibility (Rrel%) 
for all participating laboratories compared to the same 
statistics calculated based only on six laboratories using 
FP with standards (laboratory 10 was excluded from 
the group based on their self-described instrumental 
irregularities).  On average, the sub group using FP 
with standards[5] had a reduction in lower limits of 65% 
from the overall group limits, reflecting a substantial 
improvement in their ability to compare results 
when element concentrations are low.  Similarly, the 
subgroup’s Rrel% values were, on average, 55% less than 
those of the group as a whole.  While these levels may 
still leave something to be desired, it seems apparent 
that as a first step, movement toward the wider adoption 
of quantification methods utilizing FP with standards 
offers the possibility of significant improvements in 
interlaboratory reproducibility.
All Participants
Participants using 
FP with 
Standards
Iron
Calculated Lower 
Limit (L) 
0.165 0.063
Mean Rrel% for 
X>L
121% 41%
Nickel
Calculated Lower 
Limit (L) 
0.354 0.057
Mean Rrel% for 
X>L
61% 47%
Copper
Calculated Lower 
Limit (L) 
n/a n/a
Mean Rrel% for 
X>L
8% 2%
Zinc
Calculated Lower 
Limit (L) 
0.792 0.147
Mean Rrel% for 
X>L
31% 15%
Arsenic
Calculated Lower 
Limit (L) 
0.246 0.193
Mean Rrel% for 
X>L
110% 64%
Tin
Calculated Lower 
Limit (L) 
0.271 0.121
Mean Rrel% for 
X>L
40% 14%
Antimony
Calculated Lower 
Limit (L) 
0.120 0.037
Mean Rrel% for 
X>L
185% 83%
Lead
Calculated Lower 
Limit (L) 
1.217 0.226
Mean Rrel% for 
X>L
77% 27%
 
Table 6. ‘Method’s Lower Limits’ and ‘Percent Relative Reproducibility 
Indices’ as calculated for all 19 data sets compared with the same 
statistics calculated for the six top performing data sets, all using FP 
with standards software.
188       M E TA L  2 0 1 0  :  C H A R L E S T O N ,  S O U T H  C A R O L I N A ,  U S A
ASTM, ‘Designation: E 1763 – 06: Standard Guide for 
Interpretation and Use of Results from Interlaboratory 
Testing of Chemical Analysis Methods’. West 
Conshohocken: ASTM International (2006).
Chase, W.T., ‘Comparative Analysis of Archaeological 
Bronzes’, In Archaeological chemistry: a symposium 
sponsored by the Division of the History of Chemistry 
at the 165th meeting of the American Chemical Society, 
Dallas, Tex., April 9 - 10, 1973, American Chemical 
Society, Washington, DC (1974) 148-85.
de Viguerie, L., A. Duran, A. Bouquillon, V. A. Solé, 
J. Castaing, and P. Walter. ‘Quantitative X-Ray 
Fluorescence Analysis of an Egyptian Faience Pendant 
and Comparison with Pixe’, Anal Bioanal Chem 395 
((2009) 2219-25.
de Vries, J. L., and B. A. R. Vrebos. ‘Quantification of 
Infinitely Thick Specimens by Xrf Analysis’, in Handbook 
of X-Ray Spectrometry, ed. R. van Grieken and Andrzej 
Markowicz, New York: Marcel Dekker, (2002) 341-406.
Glascock, M.D., ‘An Inter-Laboratory Comparison of 
Element Compositions for Two Obsidian Sources’, IAOS 
Bulletin 23 (1999) 13-25.
Karydas, A.G., D. Anglos, and M.A. Harith. ‘Mobile 
Spectrometers for Diagnostic Micro-Analysis of 
Ancient Metal Objects’, in Metals and Museums in the 
Mediterranean: Protecting, Preserving and Interpreting, 
ed. V. Argyropoulos, Athens: TEI of Athens, (2008) 141-
77.
Long, G.L., and J.D. Winefordner. ‘Limit of Detection 
- a Closer Look at the Iupac Definition’, Analytical 
Chemistry 55 ((1983) 712A-24A.
Northover, P., and V. Rychner. ‘Bronze Analysis: 
Experience of a Comparative Programme’, In Bronze ‘96: 
L’Atelier du bronzier en Europe du XXe au VIIIe siècle 
avant Notre Ère, edited by C. Mordant, M. Pernot and 
V. Rychner, 19-40. Neuchâtel: Editions du CTHS, (1998) 
19-40
Ziebold, T.O., ‘Precision and Sensitivity in Electron 
Microprobe Analysis’, Analytical Chemistry 39 (8), (1967) 
858–61.
Author
Arlen Heginbotham received his B.A. in East Asian 
Studies from Stanford University and his M.A. in Art 
Conservation from Buffalo State College.  He is currently 
Associate Conservator of Decorative Arts and Sculpture 
at the Getty Museum where he is currently writing 
technical essays for catalogs of the Museum’s collections 
of French furniture.  Arlen’s research interests include 
the history and analysis of 17th century East Asian 
export lacquer, the history of metallurgy, the use of X-ray 
fluorescence spectroscopy as a tool for authenticating 
and interpreting gilded bronzes, microscopic and 
chemical wood identification, immunochemical analysis, 
and the history of wood dyes.
conclusions may be possible with further data analysis. 
A number of significant subjects possibly could be 
addressed using the data already collected. For instance, 
the relative advantages and disadvantages of different 
variants of the FP w/standards method; the factors 
affecting detection limits; factors affecting within-
laboratory precision; the effects of filtration; and the 
importance of careful manual inspection of spectra.  
Clearly, much work remains on the issue of inter-
laboratory reproducibility of XRF data generated for 
historical metals. As we have shown above, results 
among laboratories vary widely, not only for minor 
elements, but also for major elements. These differences 
highlight the problems associated with trying to 
compare data from multiple laboratories and the need 
for common standards and quantification approaches. 
Future research in this area should focus on addressing 
these issues.  
Endnotes
[1] The results from this instrument are designated with 
laboratory numbers 3, 7, and 22.
[2] Three of the reference materials were certified (so-
called CRMs) based on analysis by multiple laboratories 
(samples J, K, and L); one, (sample I) has no certificate of 
analysis.
[3]A complete explanation of this calculation is given in 
the ASTM E1601 sections 10.4.5 to 10.4.8.  The validity 
of the formula is contingent upon the result being larger 
than the method’s minimum standard deviation, which 
was true in every instance in this study.
[4]Using the square of ha has the dual advantages of 
making all values positive and of emphasizing the 
negative impact of occasional poor scores.  It would be 
equally valid to rank based on the absolute value of ha; 
in practice, the rank order changes very little.
[5] Six is the minimum number of participating 
laboratories required by ASTM E1601. The results 
calculated for this subgroup may therefore be considered 
as ‘valid’ based on the standard procedure.
[6] Several such software packages are available, such as 
PyMCA (European Synchrotron Radiation Facility) and 
AXIL-QXAS (International Atomic Energy Agency).
References
Hein, A., A.  Tsolakidou, I.  Iliopoulos, H.  Mommsen, 
J.  Buxeda i Garrigós, G. Montanac, and V.  Kilikogloua. 
‘Standardisation of Elemental Analytical Techniques 
Applied to Provenance Studies of Archaeological 
Ceramics: An Inter Laboratory Calibration Study’, Analyst 
127 (2002) 524-53. 
ASTM, ‘Designation: E 1601 – 98 (Reapproved 2003)
E1: Standard Practice for Conducting an Interlaboratory 
Study to Evaluate the Performance of an Analytical 
Method’. West Conshohocken: ASTM International 
(2003).