TEACHER EVALUATION: IT'S MORE COMPLICATED THAN "REFORMERS" THINK
Kenneth D Peterson

The current rush in the U.S. to reform teacher evaluation employs poorly thought out and untested tools and systems. The generally politically motivated impetus (House 1973, 1980; Klein 2011; Lind 1997; Spring 1997) often uses a hybrid of two evidence data sources of teacher quality: administrator report and student achievement test scores. These lines of evidence are among the most limited and least respected we have, according to the abundant and long standing evidence- and reason-based research literature on teacher evaluation.

Teacher evaluation reformers over-simplify the necessary purposes, personnel, district infrastructure, procedures, materials, and resources. We know better than current reform efforts suggest. Educators, policy makers, and advocates would do well to read up on the subject. The hazards of unprepared application are technical, sociological, and political failures, and that teachers merely wait out the demise of one more educational fad. The result can be lessened public understanding and support for what it takes to assess teachers in world class school systems. (A small benefit would be valuable experience and data concerning what not to do in the large-scale, high stakes evaluation of educators).

Three key gaps dominate current teacher evaluation reform efforts. The first is to ignore the role of sociology: while many technical data gathering and use issues are in play, teacher evaluation is more a sociological and political problem. The second missing piece of understanding and practice is how to incorporate student achievement gain. The third gap is how to think about, and judge, teacher quality data.

There are, however, clear paths in the research literature to successful teacher evaluation; but they are complicated and required disciplined attention. This discussion will introduce these ideas in the form of 13 questions. Each of these questions will be followed by the Common Oversimplified answers that appear in popular media, most political agency initiatives, and even in the daily talk of schools, school boards, teachers, and administrators. The Oversimplifications are unsubstantiated--we know better. Next,  for each of the 13 questions will be an account of the Complex Realities of answers: directions to more accurate understandings and the next steps for development. Most Complex Realities have a list of must read resources. These readings disclose the lack of available thought and knowledge of many plans and advocates in the current wave of teacher evaluation reform in the U.S.
1. DON'T WE ALREADY EVALUATE TEACHERS?
Common Oversimplifications


Yes, every year. Principals, or their designates, are looking for continuous improvement, teacher goal and assessment evidence, and clinical supervision observations and conferences. Administrator observation with checklists or narrative are a common data source. Many systems include individual conferences of teacher and principal. Administrators monitor each teacher for acceptable practice and employment retention.

Complex Realities
No, we don't really evaluate teachers, except as a plan described on paper, and a kind of principal task of teacher retention check.Teachers don't believe in current teacher evaluation: as described by Kauchak et al. (1985) and Wolf (1973) it depends on the person who does the evaluation. Additional realities are that we shouldn't expect much out of a no-additional-cost procedure (Danielson, 1996), public doesn't know or trust that teachers do a good job, and beginning teachers are surprised by the lack of authoritative reassurance they expected from the system (Lortie, 1975; Peterson, 1991). The inattention to the sociology of personnel evaluation greatly restricts the gathering, judgment, and use of accurate and interesting data on teacher quality and effectiveness. For example, there is considerable sociological pressure in day-to-day work for all teachers to appear the same in quality and performance, which is antithetical to reform efforts (Lortie, 1975). Another example is that principals face a serious conflict of roles as Instructional Leader vs. Summative Judge. READ:

2. DON’T WE AT LEAST KNOW HOW TO EVALUATE TEACHERS BETTER THAN WE PRACTICE?
Common Oversimplifications


Teacher evaluation has come a long way. We have widely used good catalogs or frameworks of performance observations, are able to train observers to be consistent in their reports, and know how to hold the administrators accountable! Also, we have much experience with a National Board of Professional Teaching Standards system of teacher assessment. Leave No Child Behind gave us techniques for determining "Qualified Teachers." In the past few years we have added student test scores as a way of measuring teacher effectiveness.

Complex Realities
The extensive teacher evaluation research literature is largely ignored in current reform. For example, 80 years of studies show the low reliability (thus low validity) of administrator reports of teacher performance (Peterson, 2000; Stronge, 2006), yet this is the most common data gathering practice and at the center of the vast majority of reform efforts. The limits of evaluating teachers by using student achievement test scores have been criticized both in theory (Berk, 1988; Glass, 2004) and in practice (McCaffrey et al., 2004; Schochet & Chang, 2010). The National Board of Professional Teaching Standards certification examinations have shown considerable limits in judging awardees (Pool et al., 2001). However, evidence about defensibly incorporating NBPTS findings into district teacher evaluation systems is absent.
                  Other topics missing from reform accounts include: purposes of teacher evaluation, validity of lines of evidence data, methods of interpretation/judgment/decision-making, uses for findings, supportive infrastructure at a school district level , role of peer teachers, successful pilot applications before rushing into large scale application, and teacher views including levels of teacher satisfaction with methods and systems. It is telling that the two states receiving the first round of Race To The Top federal funds immediately restructured state advisory committees on teacher evaluation rather than to rely upon their existing knowledge. READ:

                  A deficiency in teacher evaluation practice and results, but NOT in the research literature, is theoretical and practical knowledge about how sociology and politics support or (most often) preclude useful, valid, and dependable teacher evaluation (Berk, 1988; DeCharms, 1968; Etzioni, 1969; House, 1973; Johnson, 1991; Lortie, 1975; Peterson, 1984). While the technical knowledge base and questions may represent 15% of sound teacher evaluation, sociological and political issues are the other 85% of what it takes to make teacher evaluation work, and be accepted by teachers themselves.
3.  WON’T TEACHER SOCIOLOGY AND POLITICS JUST “FALL INTO PLACE” AFTER TECHNICAL EVALUATION REFORM?
Common Oversimplifications


Once we focus on student learning, teachers will be able to concentrate in their work on what really matters, instead of having to respond to the overwhelming range of expectations of widely diverse philosophies and interest groups. We need to find the teachers who are doing a good job, and dismiss those who are not. Reform includes standardizing evaluation procedures away from subjective principal reports.

Complex Realities
                  Teachers’ hegemony is essential in their evaluation (DeCharms, 1968). Good teacher evaluation is something done by professionals, rather than done to workers. Teacher authority, leadership, participation, and individual choice all are parts of necessary participation. We’ll know that parts, and the whole, of a teacher evaluation system are indeed well functioning when surveys show “improved over old system” teacher opinions in the sweet spot of the low 80 percentages. These teacher approval levels include an opt out choice for teachers who wish to retain “old system” administrator clinical supervision, lone wolves, and deficient teachers who cannot find a few or even one positive objective data sources required of all teachers. (The few poor teachers in the system are easy to spot without a new complex evaluation system.).Teacher authority also means that an effective, complex, multi-data system of administrator evaluation be in place before the teacher system begins. All of these provisions, and more, address some of the sociological and political changes too often ignored by teacher evaluation reformers.
Research shows that principal reports have substantial deficits; it’s not that principals don’t know or care about teacher quality, they can’t afford to make those distinctions in the workplace (Lortie, 1975, 2009). The classroom visit systems for gathering data are greatly restricted because of role bias and conflicts of interest (Scriven, 1991). Teacher and administrator sociological challenges can be expected in: Roles, relationships, status, hegemony, author vs. pawn, rewards, sanctions (punishments), leadership, authority, power, rituals, norms, subgroup formation, and communication patterns.
                  A change in how teachers are evaluated radically shakes up how teachers see themselves, work, relate to each other and administrators, and function in the larger school system. Teacher patterns of communication, goals, hierarchies, leadership, self-images, fears, rewards, sanctions, status, relationships, and roles all shift dramatically with their teacher evaluation system. The powerful, but most often invisible, forces of sociology and politics greatly shape teacher performance. To alter these forces through changes in teacher evaluation practice produces upheavals in the profession that require understanding and deliberate treatment. The most sophisticated and technically excellent teacher evaluation system possible today, brought into an unprepared district, should be expected to fail.
                  Sociology is essential to understanding how humans succeed in complex organizations.  For example, there is compelling experience with changing automotive assembly line workers in their roles from low performance, independent, individual, and ineffective cogs in the assembly of a car to effective, big picture, productive team members (Womack et al., 1990). Analysts of this experience important for understanding the fates of teacher evaluation reform—both of successful and disastrous transitions—have reported that it takes a decade of drastically different planned experiences for the auto workers to change. Certainly, the education of young people is more varied, difficult, and complex; and the performance and evaluation of teachers corresponds. Yet student learning, and the teachers who foster learning, are being subjected to systems of once over inspections (Scriven, 1981). READ:

4.  WHY EVALUATE TEACHERS?
Common Oversimplifications


Teacher evaluation makes teachers better, improves student learning, gets rid of bad teachers, and lets us find and reward the best teachers. Other countries with better schools and educational systems evaluate their teachers better than does the U.S., so we should change informed by foreign systems.

Complex Realities
Complex, accurate, and expensive teacher evaluation (Bridges, 1992; Peterson, 2000) rarely improves teachers or is needed to identify deficient teachers. The research literature lacks reports where, having good baseline data, a teacher evaluation system is introduced and then shown to have (a) improved student learning, or (b) changed the way teachers teach, or (c) even have changed the way teachers think about their work.
                  However, there are many very good other reasons to evaluate:  to document current good practices and results, reassure teachers of a needed and effective job, give the great majority of teachers added job security, reassure audiences and stakeholders, identify good teaching practices for emulation, learn more about how to hire new teachers, ward off bad evaluation practices, inform research about effective teaching and learning, and in a few instances, to improve teachers (Peterson, 2000). An entirely overlooked reason to evaluate teachers is to select volunteer individuals for key leadership, extra duty assignments (e.g., student teacher, new teacher mentor, hiring committee, textbook/curriculum committee, department or grade level chair, or teachers’ organization or public relations representative) based on competitive merit (as judged by teacher dominated panel) determined from teacher-chosen multiple data sourced dossiers (Peterson, 2000). Finally, schools or school districts can aggregate individual teacher data (e.g., parent surveys or professional activity) into school or district profiles. These data are useful for public relations and political advocacy (Blair, 2004; Warner, 2000).
                  Teachers have a strong vested interest in good teacher evaluation. The vast majority of teachers are doing a high quality and needed professional performance. Public recognition and acknowledgement is an important sociological and political necessity. Teachers are evaluated every day, but with happenstance and hearsay methods. Most praiseworthy events and performances are not systematically noted. More commonly, inaccurate haphazard and hearsay events result in harsh and negative “assessment”. The question to educators is to have good evaluation or bad evaluation. Teachers are in competition with prison- and road-builders and necessity to repair frayed social safety nets. Credible data about teachers makes a difference in economics and politics.
                  The teacher evaluation practices of other countries are, in fact, not advanced beyond our own. Their "superior" educational outcomes are a result of many other factors, including greater financial support and public esteem for teachers. For example, the U.S. is one of the few industrialized countries with low teacher social prestige, and which does not pay for candidate student teacher education and related expenses. Current directions in this country are to further reduce teacher benefits and resources.
5.  WHAT IS TEACHER QUALITY?
Common Oversimplifications


Great teachers have personality characteristics or traits that motivate students to learn! A good/great teacher certainly knows her subject matter. Good/great teachers get good/great results in academic achievement. We’ve got expert catalogs of observable performance indicators—e.g., twenty-two components “Teacher’s feedback to students is timely and of consistently high quality, and students make use of the feedback in their learning” (Danielson, 2007, p. 89)  in four domains: planning and preparation, classroom environment, instruction, professional responsibilities at four levels of performance: Unsatisfactory, Basic, Proficient, Distinguished. We certainly should be able to recognize good/great teaching by visiting the classroom; if it’s not visible to an adult visitor, the students aren’t getting it!

Complex Realities
There are a great many teacher performances that make a difference in student learning, for example, curriculum designer, motivator, personal relationship builder, relevant applications author, good work environment creator, connections maker with a future need or opportunity, systematic and stepwise teacher of discrete parts, or holistic and global developer, or whole-class effective or individualized instructor. Actually, individual teachers are effective in their work not by doing all of these possibilities, but by different constellations of two or three of these things done very well. Often, some to many of the “important” components of a catalog of possible teacher performances are done perfunctorily, poorly, or not at all by excellent teachers—and it doesn’t make a difference in student learning. A human classroom experience is dominated by what is present, and what effectively produces the learning at the moment. The absent possibilities don’t make the difference. Project Follow-through was the first of many multiple regression studies of learning that suggest that Teacher Effects (the actual few contributions of a teacher to student learning) outweigh ideal Curriculum Effects by about 10 to 1.
                  Because of this variety of effectiveness, it is not likely that a rigid catalog of expected components, assessed with snapshot visits will tell us much about teacher quality. Indeed, eighty years of empirical research evidence has shown the fallacy of evaluating teachers by looking (Peterson, 2000; Stronge, 2006). Scriven (1981) cited some of the factors making observation a weak evidence data source about teaching: the visit changes the teaching, small number of visits yield inaccurate sample, personal sociological and political prejudices of visitor dominate judgments, style of teacher dominates visitor personal preferences, and the reality that adult educators and young people do not think the same way. Yet, administrator report remains one of the cornerstone, or even dominant, lines of evidence of teacher evaluation reform programs. READ:

6.  CAN’T WE JUST FIND AND MEASURE STUDENT LEARNING?
Common Oversimplifications


If students are learning what else do we care about? Let’s see what records we have to begin with, then beef up our testing system to a full complement of state and national systems. (Maybe we’ll have to balance the student learning results with some administrator safe guards.)

Complex Realities
A major deficiency in teacher evaluation is knowing how to appropriately use student learning in the evaluation of teachers. The damaging critiques of judging teachers by using student learning include conceptual (validity) and empirical (reliability part of validity) problems. READ:

                  Pertinent, defensible tests not available for all teachers, for example performing arts, physical education, practical trades classes, and specialties in the sciences and mathematics. Even for subject areas where good tests do exist, individual teachers may not be able to produce defensible pupil gain data (e.g., primary grades, older student specialty subjects, teachers of high transient students, and having many low attendance pupils). These include teachers where three years of records are absent and limited student sample sizes. In addition, the purportedly best actual cases of district VAM data analysis are discouraging (McCaffrey et al., 2004; Schochet & Chang; 2010). These latest analyses indicate that the choice of statistical modeling strategy changes the results of actual teacher results, as does the choice of data (specific test selection, omitted variables, confounders, missing data, controlled comparisons across schools), and large but unknown sampling errors. Some truly superior teachers will have to be evaluated without considering pupil achievement gain data simply because they are not available.
                  Student achievement data can be used as an effective line of evidence for some but certainly not all teachers in a school district. A teacher evaluation system that entirely ignores pupil achievement will not be credible to audiences and stakeholders. Individual teacher choice of high quality student adjusted gain data should be available for the teachers who can make the case for inclusion as one line of evidence in a their own multiple but variable data source evaluation process (Peterson, 2006).
 7.  WHAT LINES OF TEACHER PERFORMANCE EVIDENCE DATA SHOULD WE LOOK FOR?
Common Oversimplifications


Principal observations and reports will always remain a key, or even dominant, part of teacher evaluation. Student achievement data are on the brink of another mandatory component. If we do need to broaden from that information, certainly other data sources could be developed; e.g., students, parents, and even peer teachers might be called upon.

Complex Realities
Quality teaching comes in many different forms. More accurately, good or great teaching comes from small constellation of a few performance factors very well done (e.g., communication skills, insights, materials, interpersonal skills, experiences, subject matter expertise, persistence, hard work) that vary in combinations with each teacher. In all cases these instructional strengths meet the demonstrated needs and priorities of students, using ethical practices. It shouldn’t be surprising that there are many different indicators of teacher effectiveness, and that the important evidence will vary by individual teacher (Peterson, 2006).
                  It might be argued that the best evidence is of teacher quality is Payoff: student learning. However, as was described with Question #6, there are limits of applicability and reliability to this single important criterion or indicator of good teaching. A second level of quality concern is the Process of teaching itself: how a teacher works, his or her instructional repertory, communication skills, professional knowledge, subject matter expertise, flexibility based on student responses, and instructional competencies such as assessment. We’ve historically relied upon Principal Reports to gauge teacher performance with student learning and teaching process—but Realities of earlier Questions addressed the great inadequacies of traditional teacher evaluation. Teaching Process procedure quality can be reported upon by trained observers, students, and even at a greater distance, parents. Peer teachers should be limited to curriculum reviews, at a sociological and political remove (McCarthey & Peterson, 1987). Finally, although Payoff and Performance data are closer to the central goals of teacher quality, teacher Preparation for quality work and student learning is a legitimate source of evaluation data. Preparation includes teacher knowledge of subject matter and pedagogy, time spent on upgrading professional topics, collaborative work with other teachers, leadership, and indirect evidence such as graduate degrees. The teacher evaluation data scene is complex, multi-leveled, at times indirect, and truly individualized.
                  Teacher data sources that each work (are valid) for some but not every teacher include: Student achievement, peer review of materials, student surveys, National Board certification, teacher tests of pedagogy and/or subject matter, parent surveys, documentation of professional activity/preparation, Action Research payoffs, administrator report, documentation contributions to underserved students, and evidence Unique to Teacher. No individual data source is applicable to every teacher; mandating certain sources causes them to fail. There are data sources to avoid! (testimonials, peer visits to classroom, peer consensus reports, graduate follow ups, microteaching performance, self-evaluation, classroom environment, portfolio analysis). Individual teachers are the best authority on which sources best make their case for quality performance.
                  Credibility requires that although the teacher selects the particular evidence sources (e.g., parent survey, teacher test results, peer review of materials, principal report), the actual data be collected by an independent district evaluation unit. These multiple lines of evidence, presented in dossiers, require expert teacher dominated panels to interpret, judge, and rank (Peterson, 2000). READ:

                  The practical result for teacher evaluation is  a system multiple and variable data sources, as each teacher is called to assemble the best evidence for his or her case of competence or excellence. Rather than mandate eight or ten possible data sources for each individual, teachers can be given the task of presenting “credible data from a certain number (four each year) of a menu list of possible data sources.” The game changes from “match these fixed criteria,” to one of “you know your own effectiveness best: make your best case using valid evidence.” Teachers have two levels of decision: 1) choice of data sources to be gathered, and 2) confidential inspection and approval of results before disclosing them to others and use in the evaluation system. Administrators in this approach change from sole summative judges (Scriven, 1981) to instructional leaders (with known but balanced biases).
8.  HOW WE JUDGE OR DECIDE ON TEACHER QUALITY?
Common Oversimplifications


Ask the principal, or look at her records. We could visit the room and see for ourselves. (Couldn’t we ask the students, other teachers, or parents?). Standardized results of trained observers, using uniform domains and categories of visible, low inference observations, guided by clear rubrics, allow us to categorize or even rank teachers across a broad range of classrooms and grade levels. When teacher evaluation is reformed we might be able to look at numerical rankings—either student test scores or some combination of student achievement and administrator rating. Interpretation is a simple matter of comparing scale scores or teacher rankings—a clerk, or a layperson reading a newspaper, could do it.

Complex Realities
Big gap #3 in teacher evaluation reform is to fail to understand how to judge teacher performance, once we have the data. To begin with, individuals are poor judges of complex teacher performance data because of role fixation (Cook & Richards, 1972), sociological/logistical conflict of interest (Lortie, 1975), personal style preference bias (Scriven, 1981), and lack of expertise. The practical answer to these problems is a teacher dominated panel, supplemented with representatives of other stakeholders (administrators, parents, students). Panels of four teachers, two administrators, one parent, and one high school senior have demonstrated validity (Peterson, 1988). Decision making panels, much like legal juries, are furnished with the best credible objective evidence available, teacher choice of available data sources, deliberate checks on bias, and regularized procedures. READ:

9.  WHAT DO WE DO WITH OUR EVALUATION RESULTS?
Common Oversimplifications


We need to know who are good teachers are, so we can acknowledge and even publicize them. Many of our best teachers are fed up being lumped with all of the rest, less effective, hard working, and dedicated teachers. Identifying the ones with consistently highest student achievement scores would motivate the others to work in similar ways, and to focus on true excellence in teaching. More public teacher evaluation would drive the worst teachers out. Certainly other occupations have visible indicators of quality like repeat clients, success of those they serve, and how much the market is willing to pay for their services.

Complex Realities
Much thought and care must go into uses of teacher evaluation. Narrow measures, like pupil achievement tests, do not reflect the universe of teacher tasks and contributions. The sociological team nature of teaching can be disturbed by competitive ranking: it is important in the day-to-day functioning of a school to see all the performers as doing the needed job, and contributing equally to the learning achievements and future successes of the students served by the school. Much sociological damage can be done to a school system by naively rushing into an oversimplified untested teacher evaluation system. Good alternatives are available for development over a necessary five year (or more) patiently and expertly introduced process, beginning with teacher volunteer pioneers (Peterson & Peterson, 2006) whose positive experiences will drive adoption.
                  The first uses of teacher evaluation data are authoritative reassurance (Lortie, 1975) for practitioners and audiences about existing good work, results, and identification of exemplary practices. These data already are known by individual teachers as they have explored lines of evidence, like client surveys, peer reviews of materials, and various teacher and student test results. Studies show that a small percentage of teachers will use these results as formative evaluation information for improvement. Other teachers seek professional direction and improvement using their own sense of opportunity and need, independent of concrete data. The first stage of data use is to reassure and acknowledge the high percentage of currently well functioning teachers.
                  A second stage of data use can be to aggregate individual teacher results: professional activities and academic accomplishments, student achievement data, student and parent surveys, teacher test scores, action research or school improvement project payoffs, peer reviews of materials, systematic observation, and unique evidence. These aggregates present a story of the successful preparations for learning, processes of educating, and the payoff outcomes that teachers have been responsible for as a group. This aggregated information informs staff development, but also public relations (Blair, 2004; Warner, 2000). It is essential for school districts to carry the direct evidence to the communities they serve. Truth squads (teacher led and presented), equipped with commercial grade print and electronic media, can present empirical evidence from a broad range of teacher performance can take their data-dense findings to community meetings, civic organizations, cable-tv, city councils, legislative study groups, and media. Public relations can include routine press releases, district newsletters, websites and blogs, and parent meetings.
                  A third kind or stage of teacher evaluation is in service of promotion systems, such as found in higher education. Teacher rank systems can serve a number of sociological functions if used in K-12 school districts. Ranks, such as “Professional Teacher,” “Senior Teacher,” and “Master Teacher” can be created whereby every five to seven years a teacher can voluntarily submit a dossier of multiple and variable data sources to document their high quality work. These promotions should carry a significant salary increase—separate from the “step increase” system. Promotion decisions should be voluntary on the teacher’s part (except, perhaps, at a “tenure” time), non-quota competitive, and repeatable for those declined.
                  The fourth stage of teacher evaluation data is more specialized and limited: the appointment of  teacher leaders for extra duty having substantial extra pay. Rigorous, multiple- and variable-data using, teacher-dominated panels can judge teachers who volunteer for teacher leadership positions. Substantial extra pay makes sense for these service positions based on true, even competitive, demonstrated teacher merit. These extra duties include master teacher for student teacher, grade level and department chairs, curriculum committee heads, textbook selection committees, school wide discipline committees, teacher hiring committees, and parent liaisons. Underperforming schools should be staffed entirely by Master Teachers with extensive dossiers of competitively superior objective data showing a track record of uncommon success with students at risk of not graduating (Peterson et al., 1991). These positions can be filled by competitive dossier ranking. While competition does not support the vast majority of teachers and their functions, it can benefit a few teacher roles and individuals.
                  Merit pay, attempted in countless forms, has not demonstrated positive outcomes, and has been a documented source of problems for school districts.
10.  HOW SHOULD TEACHERS BE INVOLVED IN THEIR OWN EVALUATION?
Common Oversimplifications


Administrators do the hiring, they should do the evaluations and firings. Administrators are the only persons trained, licensed, and hired to evaluate teachers. Teachers do not see other teachers in action, and are not responsible for teacher performance. Teacher relationships prevent accurate, unbiased, and fair involvement.

Complex Realities
Teachers have perhaps the greatest immediate and long-term stakes in good teacher evaluation. Currently they don’t believe in it, or its results (Wolf, 1971). Research literature is clear that teachers should not visit and report on classrooms, aggregate their hearsay opinions to rate each other, take a year off to become expert evaluators of others. Research literature is clear that teachers, if well screened to avoid conflicts of interest, can best judge the quality of each others’ instructional materials, rank the quality of teacher dossiers (collections of the best objective data chosen by each teacher under review), choose the best data sources in their own case, help supervise the collection of data in a school district, and advise superintendents and school districts on teacher evaluation policy. READ:

11. HOW MUCH SHOULD WE SPEND ON TEACHER EVALUATION?
Common Oversimplifications


Schools already are strapped for money. Teacher evaluation should be an important part of the principal job. Also, we can use test results from state and other existing programs. “Schools or districts using a framework for professional practice do not face the difficulty of high cost encountered by large-scale state or national systems” (Danielson, 1996). If there are additional expenses, they will be to train teachers and principals in the new reform efforts.

Complex Realities
Effective, defensible, high quality teacher evaluation is complex, and expensive. There are two constants in the universe: love hurts and teacher evaluation costs. The costs include personnel, time, and money (Levin et al., 1987). Scriven’s (1973) Comprehensive Pathways model of evaluation included Comprehensive Cost data (time, morale, dollar, time, installation time, displacement of previous practices), unintended Side Effects (+ and -), durability, and identification of Critical Competitor evaluation systems. Peterson (1989) reported dollar costs of $44.70 per teacher per year for an eight data source career ladder teacher evaluation system. Small school districts can form combines of resources, or join with neighboring large districts. Other studies of actual implementation are much needed.
                  Hoenack and Monk (1990) examined the larger economic issues associated with teacher evaluation. They concluded that teacher evaluation systems that focus on the teacher as the cause of student learning (production) are in error, and that close economic analysis of teacher performance discloses that individual teachers are most productive in certain specific setting of subjects, age levels, and student characteristics. The goodness of the match between the teacher and the setting is important for understanding that particular teacher, and needs to be made explicit.
A major expense for successful teacher evaluation is development of a school district infrastructure (Peterson, 2000). Essential new structures include a stakeholder Evaluation Board, which debates policy and procedures, supports and nourishes evaluation ideas and progress, and makes recommendations to superintendent.  A second new body is an Evaluation Unit: jointly headed by teacher organization and central administration representatives, part time clerical staff, and expert consultants. This Unit is responsible for gathering credible data at the directions of individual teachers. As described earlier, teachers have the opportunity to call for their choices of data sources, confidentially see the results, and (only) then to decide on which data will be included in their own individual evaluation dossier. These are two levels of teacher control. The Unit keeps confidential records, updates the best instruments and procedures, and actually implements the data gathering for the entire district. A final expense is for public relations using the considerable credible, positive, specific information about the good results of the school district teachers and program.
                  Educator training for teacher evaluation involvement is expensive when it is not a current part of educator thought, everyday practice, understanding, and choice—when we don’t use current educator professional wisdom. For example, student achievement is not an explicit component of teacher evaluation because educators know how difficult it is to incorporate. Training is much less needed for practices that are satisfying, reasonable, sensical, fair, and current—these are the clues needed to design an effective system. The amount of training necessary is a gauge of the validity (including reliability) of the proposed practices. READ:

12.  DON’T WE NEED TO INCLUDE SUCH THINGS AS STAFF DEVELOPMENT, MERIT PAY, DIFFERENTIATED STAFFING IN REFORM?
Common Oversimplifications


Staff Development is integral with Teacher performance and assessment: why evaluate teachers if you are not going to improve them at the same time? The need for evaluation is driven by calls for changing the way teachers are paid and assigned; this seems to be a good opportunity for rethinking teacher roles.

Complex Realities
While many topics are related to accurate and valid teacher evaluation, the techniques, procedures, materials, and expectations for evaluation itself needs a primary focus for development. Then, extensions and uses can be developed. These include the many purposes for teacher evaluation, but especially exemplars for dissemination and teacher leadership based upon demonstrated merit. At this point, teacher evaluation deserves, and requires, a concentrated focus.
13.  HOW SOON CAN WE GET THIS THING GOING? 
Common Oversimplifications


Let’s get started—we need results now! Let’s form an advisory committee, have some teachers work on the criteria, hold some meetings, train the observers, and do the first round.

Complex Realities
 Installation of a high quality, complex teacher evaluation system, including time for sociological and cultural change requires a minimum of 4-5 years. As a perspective, the Womack et al. (1990) account of the similar sociological and skill changes for automotive assembly line workers recommended a decade! The tasks of innovation include learning the technical and sociological components, reading the research literature, getting experience with the data gathering requirements, building the district infrastructure, solving the political balances of co-sponsorship (administrators and teachers), and getting volunteers to try out components. A difficult change will be recognizing and addressing the sociological conditions (roles, status, hegemony, rewards, relationships, conflict resolution). The requirement for successful reform is for building a different Culture for the task of enhanced teacher evaluation that eventually adds to student learning and teacher career satisfaction.
                  Genuine reform in teacher evaluation means a change from current practice into something done by professionals, not done to workers. Teachers, teacher organizations, administrators, and state and national educators all must be partners in the effort—or, the reforms will not survive. A school or district teacher evaluation system should begin with a good administrator evaluation program—or begun simultaneously. Teacher evaluation should be introduced in stages of pilot studies (Peterson & Chenoweth, 1992; Peterson & Peterson, 2006) and with early volunteer adopters. It then becomes the duty and necessity of the reformers to protect the pioneers from known destabilizations. The teacher evaluation itself must systematically undergo both formative and summative evaluation.
SUMMATION
                  This purpose of this discussion was to establish the complexity of teacher evaluation beyond that implied by current efforts at practical reform. Effective teacher evaluation is not too complex to be practical. However, much effort is required to understand the components and their interactions, and to carry out the requirements for valid practice (House, 1980). There is reason for renewed research in the field: one goal of systematic inquiry should be teacher evaluation that is as good or better than that outlined here, but simpler, less expensive, and more satisfactory!
REFERENCES

Barr, A.S. & Burton, W.H. (1926). The supervision of instruction. New York: D. Appleton.
Berk, R.A. (1988). Fifty reasons why student gain does not mean teacher effectiveness. Journal of Personnel Evaluation in Education, 1 (4), 345-363.
Blair, J. (2004). Building bridges with the press: A guide for educators. Bethesda, MD: Education Week Press.
Bridges, E.M. (1992).  The incompetent teacher  (2nd ed.).  Philadelphia: Falmer Press.
Coker, H., Medley, D.M. & Soar, R.S. (1980).  How valid are expert opinions about effective teaching?  Phi Delta Kappan, 62 (2), 131-134, 149.
Danielson, C. (1996). Enhancing professional practice: A framework for teaching. Alexandria, VA: Association for Supervision and Curriculum Development.
DeCharms, R. (1968). Personal causation. New York: Academic Press.
Erikson, E. (1977). Toys and reasons. New York: Norton.
Etzioni, A. (1969). The semi-professions and their organization. New York: Free Press.
Frymier, J. (1987). Bureaucracy and the neutering of teachers. Phi Delta Kappan, 69 (1), 9-14.
Glass, G.V. (2004). Teacher evaluation: A policy brief. EPSL-0401-112-EPRU. Arizona State University. Tempe, AZ: Education Policy Studies Laboratory.
Hoenack, S.A., & Monk, H.H. (1990). Economic aspects of teacher evaluation. In J. Millman & L. Darling-Hammond (Eds.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers (pp. 390-402).  Newbury Park, CA: Sage.
House, E. (Ed.) (1973). School evaluation: The politics and process.  Berkeley, CA: McCutchan.
House, E. (1980). Evaluating with validity. Beverly Hills, CA: Sage.
Johnson, S.M. (1990). Teachers at work: Achieving success in our schools. New York: Basic Books.
Kauchak, D., Peterson, K. & Driscoll, A. (1985).  An interview study of teachers' attitudes toward teacher evaluation practices.  Journal of Research and Development in Education, 19 (1), 32-37.
Klein, J. (2011). Scenes from the class struggle. The Atlantic, 307 (5), 66-77.
Levin, H.M., Glass, G.V. & Meister, G.R. (1987).  Cost-effectiveness of computer-assisted instruction.  Evaluation Review, 6 (1), 50-72.
Lind, M. (1997). Up from conservatism. New York: Free Press.
Lortie, D.C. (1975). Schoolteacher: A sociological study. Chicago: University of Chicago Press.
Lortie, D.C. (2009). School principal. Chicago: University of Chicago Press.
McCaffrey, D. F., Koretz, D., Lockwood, J. R., & Hamilton, L. S. (2004). The promise and peril of using value- added modeling to measure teacher effectiveness (Research Brief No. RB-9050-EDU). Santa Monica, CA: RAND.
Medley, D.M. & Coker, H. (1987). The accuracy of principals' judgments of teacher performance. Journal of Educational Research, 80 (4), 242-247.
Peterson, K. (1984).  Methodological problems in teacher evaluation.  Journal of Research and Development in Education, 17 (4), 62-70.
Peterson, K.D. (1988).  Reliability of panel judgments for promotion in a school teacher career ladder system.  Journal of Research and Development in Education, 21 (4), 95-99.
Peterson, K.D. (1989).  Costs of school teacher evaluation in a career ladder system.  Journal of Research and Development in Education, 22 (2), 30-36.
Peterson, K.D. (1990). Assistance and assessment for beginning teachers. In J. Millman & L. Darling-Hammond (eds.) The new handbook of teacher evaluation: Assessing elementary and secondary school teachers. (pp 104-115). Newbury Park, CA: Sage.
Peterson, K.D. (2000). Teacher evaluation: A comprehensive guide to new directions and practices (2nd ed.).  Thousand Oaks, CA: Corwin Press, Inc.
Peterson, K.D. (2006). Managing multiple data sources in teacher evaluation.  In Stronge, J. (ed.), Current best practices in teacher evaluation (2nd ed.) (pp. 212-232).  Thousand Oaks, CA: Corwin.
Peterson, K.D., Bennet, B., & Sherman, D. (1991). Themes of uncommonly successful teachers with at-risk students. Urban Education, 26, 176-194.
Peterson, K.D. & Chenoweth, T. (1992).  School teachers’ control and involve­ment in their own evaluation.  Journal of Personnel Evaluation in Education, 6, 177-189.
Peterson, K.D., & Peterson, C.A. (2005). Effective teacher evaluation: A guide for principals. Thousand Oaks, CA: Corwin Press.
Pool, J.E., Ellett, C.D., Schiavone, S., & Carey-Lewis, C. (2001). How valid are the National Board of Professional Teaching Standards assessments for predicting the quality of actual classroom teaching and learning? Journal of Personnel Evaluation in Education, 15 (1), 31-48.
RAND Education (2004). The promise and peril of using value-added modeling to measure teacher effectiveness. Research Brief RB-9050-EDU. Santa Monica, CA: Rand Corporation.
Rosenholtz, S.J. (1989). Teacher’s workplace: The social organization of schools. Research on Teaching monograph series. New York: Longman.
Schochet, P.Z., & Chiang, H.S. (2010). Error rates in measuring teacher and school performance based on student test score gains. (NCEE 2010-4004). National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences. Washington, DC: U.S. Department of Education.
Scriven, M. (1973).  The evaluation of educational goals, instructional procedures and outcomes.  ERIC Document Reproduction Service ED 079 394.  Berkeley, CA: University of California.
Scriven, M. (1981). Summative teacher evaluation. In J. Millman (Ed.), Handbook of teacher evaluation (pp. 244-271). Beverly Hills: Sage.
Spring, J. (1997). Political agendas for education. Mahwah, NJ: Erlbaum.
Stodolsky, S.S. (1984). Teacher evaluation: The limits of looking. Educational Researcher, 13 (9), 11-18.
Stronge, J. (2006). Current best practices in teacher evaluation (2nd ed.).  Thousand Oaks, CA: Corwin.
Warner, C. (2000). Promoting your school: Going beyond PR (2nd ed.). Thousand Oaks, CA: Corwin Press.
Wise, A.E., Darling-Hammond, L., McLaughlin, M.W. & Bernstein, H.T. (1984).  Teacher evaluation: A study of effective practices.  Santa Monica, CA: Rand Corporation.
Wolf, R. (1973).  How teachers feel toward evaluation.  In E. House (Ed.), School evaluation: The politics and process (pp. 156-168).  Berkeley, CA: McCutchan.
Womack, J., Jones, J.T., & Roos, D. (1990). The machine that changed the world. New York: Free Press.